Data Description:
The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled). The data has 8 quantitative input variables, and 1 quantitative output variable, and 1030 instances (observations).
Domain: Cement manufacturing
Context:
Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.
Attribute Information:
Learning Outcomes:
#Numerical Calculations
import numpy as np
import pandas as pd
from scipy.stats import norm, shapiro, zscore
#Data Visualization
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cm
plt.style.use('ggplot')
import seaborn as sns
from sklearn import tree
from sklearn.tree import export_graphviz
from IPython.display import Image
from sklearn import set_config
set_config(display='diagram')
#Train, Validation and Test set preparation
from sklearn.model_selection import train_test_split
#Feature Engineering and Data Preprocessing
from sklearn.preprocessing import PolynomialFeatures, QuantileTransformer
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.preprocessing import LabelEncoder
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
#Feature Selection
from sklearn.feature_selection import SelectFromModel
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
#Model Selection
from sklearn.model_selection import validation_curve, learning_curve
from sklearn.model_selection import GridSearchCV, KFold, ShuffleSplit, cross_val_score
#Unsupervised Learning Techniques
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from scipy.spatial.distance import pdist, cdist
from sklearn.metrics import silhouette_score, silhouette_samples
#Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
#Model Building - Regressors
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb
from sklearn.svm import SVR
#Model Evaluation
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from mlxtend.evaluate import paired_ttest_5x2cv
#Misc
import os
import warnings
warnings.filterwarnings('ignore')
random_state = 24
# from google.colab import drive
# drive.mount('/content/drive')
#os.chdir('/content/drive/My Drive/Colab Notebooks/GL Projects Portfolio/Project 5 - Featurization Model Selection and Tuning')
concrete = pd.read_csv('concrete.csv')
concrete.head()
| cement | slag | ash | water | superplastic | coarseagg | fineagg | age | strength | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 141.3 | 212.0 | 0.0 | 203.5 | 0.0 | 971.8 | 748.5 | 28 | 29.89 |
| 1 | 168.9 | 42.2 | 124.3 | 158.3 | 10.8 | 1080.8 | 796.2 | 14 | 23.51 |
| 2 | 250.0 | 0.0 | 95.7 | 187.4 | 5.5 | 956.9 | 861.2 | 28 | 29.22 |
| 3 | 266.0 | 114.0 | 0.0 | 228.0 | 0.0 | 932.0 | 670.0 | 28 | 45.85 |
| 4 | 154.8 | 183.4 | 0.0 | 193.3 | 9.1 | 1047.4 | 696.7 | 28 | 18.29 |
Checking the shape of the dataset
print('No of Rows : ', concrete.shape[0])
print('No of Columns : ', concrete.shape[1])
No of Rows : 1030 No of Columns : 9
Checking the datatypes and null records
concrete.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1030 entries, 0 to 1029 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cement 1030 non-null float64 1 slag 1030 non-null float64 2 ash 1030 non-null float64 3 water 1030 non-null float64 4 superplastic 1030 non-null float64 5 coarseagg 1030 non-null float64 6 fineagg 1030 non-null float64 7 age 1030 non-null int64 8 strength 1030 non-null float64 dtypes: float64(8), int64(1) memory usage: 72.5 KB
Checking total number of Null values in the dataset
concrete.isnull().sum()
cement 0 slag 0 ash 0 water 0 superplastic 0 coarseagg 0 fineagg 0 age 0 strength 0 dtype: int64
Checking if the dataset has only Numeric Data
concrete.applymap(np.isreal).all()
cement True slag True ash True water True superplastic True coarseagg True fineagg True age True strength True dtype: bool
Observations:
concrete.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| cement | 1030.0 | 281.167864 | 104.506364 | 102.00 | 192.375 | 272.900 | 350.000 | 540.0 |
| slag | 1030.0 | 73.895825 | 86.279342 | 0.00 | 0.000 | 22.000 | 142.950 | 359.4 |
| ash | 1030.0 | 54.188350 | 63.997004 | 0.00 | 0.000 | 0.000 | 118.300 | 200.1 |
| water | 1030.0 | 181.567282 | 21.354219 | 121.80 | 164.900 | 185.000 | 192.000 | 247.0 |
| superplastic | 1030.0 | 6.204660 | 5.973841 | 0.00 | 0.000 | 6.400 | 10.200 | 32.2 |
| coarseagg | 1030.0 | 972.918932 | 77.753954 | 801.00 | 932.000 | 968.000 | 1029.400 | 1145.0 |
| fineagg | 1030.0 | 773.580485 | 80.175980 | 594.00 | 730.950 | 779.500 | 824.000 | 992.6 |
| age | 1030.0 | 45.662136 | 63.169912 | 1.00 | 7.000 | 28.000 | 56.000 | 365.0 |
| strength | 1030.0 | 35.817961 | 16.705742 | 2.33 | 23.710 | 34.445 | 46.135 | 82.6 |
Observations:
age, all other predictor features are of same scale measured in kg/m^3.age is measured in Number of days whereas strength is measured in MPa.slag and ash has skewed distributions with no values within the 25%, 50% quantiles. Distribution is also sparse considering the mean value of 73 and 54 respectively against corresponding Standard Deviations of 86 and 63 respectively. SD>Mean shows that the variance is skewed towards one of the tails.superplastic also is having skewed distribution with no values in 1st and 2d quantiles.water seems to be well distributed with values in all the quantiles & considering the min and max ranges.age is a discrete variable with values ranging from 1 to 365 days (max one year). Hence scaling would be required prior to Model Building.strength is also distributed with sufficient representation in all the quantiles.def plot_univariate_features(df):
"""
Helper function to plot Univariate features.
Input : Dataframe; Output : All Univariate plots for both Numeric and Categorical variables.
"""
print("Integer Columns = ",df.select_dtypes(include=['int32','int64']).columns)
print("Floating Point Columns = ",df.select_dtypes(include=['float64']).columns)
print("Object Columns = ",df.select_dtypes(include=['object']).columns)
print("Category Columns = ",df.select_dtypes(include=['category']).columns)
#sns.set_style(style='darkgrid')
int_cols = pd.Series(df.select_dtypes(include=['int32','int64']).columns)
for j in range(0,len(int_cols)):
f, axes = plt.subplots(1, 2, figsize=(10, 10))
sns.boxplot(df[int_cols[j]], ax = axes[0], palette='Greens_r')
sns.distplot(df[int_cols[j]], ax = axes[1], fit=norm)
plt.subplots_adjust(top = 1.5, right = 10, left = 8, bottom = 1)
float_cols = pd.Series(df.select_dtypes(include=['float64']).columns)
for j in range(0,len(float_cols)):
plt.Text('Figure for float64')
f, axes = plt.subplots(1, 2, figsize=(10, 10))
sns.boxplot(df[float_cols[j]], ax = axes[0], palette='Greens_r')
sns.distplot(df[float_cols[j]], ax = axes[1], fit=norm)
plt.subplots_adjust(top = 1.5, right = 10, left = 8, bottom = 1)
obj_cols = pd.Series(df.select_dtypes(include=['object']).columns)
for j in range(0,len(obj_cols)):
plt.subplots()
sns.countplot(df[obj_cols[j]])
plot_univariate_features(concrete)
Integer Columns = Index(['age'], dtype='object')
Floating Point Columns = Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
'fineagg', 'strength'],
dtype='object')
Object Columns = Index([], dtype='object')
Category Columns = Index([], dtype='object')
Observations:
cement, ash, coarseagg variables do not have any outliers.slag, ash and superplastic variables have strong right skewness. Hence it would be advisable to do Feature Transformation such as Log Transform / Quantile Transform during prior to Model Building.slag, ash and superplastic variables also has several 0 values in its distribution. It is not advisable to impute them without having additional business justification. In this project, we will not impute them and keep them as it is.cement, water, ash and superplastic variables. Clustering / Gaussian Mixture models will be helpful here to analyse and understand more about them.strength variable is almost Normally distributed.normality_test = lambda x: shapiro(x.fillna(0))[1] < 0.01
normal = concrete
normal = normal.apply(normality_test)
print(~normal)
cement False slag False ash False water False superplastic False coarseagg False fineagg False age False strength False dtype: bool
None of the variables pass Shapiro test for Normality at 0.01 Significance level.
plt.figure(figsize=(10,10))
concrete.skew().sort_values().plot(kind='barh')
plt.show()
fig = plt.figure(figsize = (20, 15))
ax = sns.boxplot(data = concrete, orient = 'h')
plt.show()
def outliers_IQR(df):
"""
Helper function to detect Outliers in the Dataframe
Input : Dataframe; Output : Dataframe containing Percentage and Number of Outliers.
"""
columns = df.columns
outliers_list = []
no_of_outliers = []
for c in df.columns:
Q1 = df[c].quantile(0.25)
Q3 = df[c].quantile(0.75)
IQR = Q3 - Q1
df_outliers = np.where((df[c] < (Q1 - 1.5 * IQR)) | (df[c] > (Q3 + 1.5 * IQR)))
no_of_outliers.append(len(df_outliers[0]))
outliers_list.append(round((len(df_outliers[0]) / len(df[c]) * 100),2))
print("\n")
outliers_df = pd.DataFrame({"Percentage_of_Outliers":outliers_list, "No_of_Outliers":no_of_outliers}, index=df.columns)
return outliers_df.sort_values(by="Percentage_of_Outliers", ascending=False)
outliers_IQR(concrete)
| Percentage_of_Outliers | No_of_Outliers | |
|---|---|---|
| age | 5.73 | 59 |
| superplastic | 0.97 | 10 |
| water | 0.87 | 9 |
| fineagg | 0.49 | 5 |
| strength | 0.39 | 4 |
| slag | 0.19 | 2 |
| cement | 0.00 | 0 |
| ash | 0.00 | 0 |
| coarseagg | 0.00 | 0 |
Observations:
age has around 6% of the values as outliers, which is significant and also it is a discrete ordinal variable. Hence, imputation would not be done as discussed previously.superplastic, water exhibits around 1% of the data as outliers. # Log Transformation of below features would be carried out in Feature Engineering step
outlier_treatment_features = ['superplastic', 'fineagg', 'slag']
from statsmodels.graphics.regressionplots import influence_plot
import statsmodels.api as sm
X_sm = concrete.iloc[:,0:-1]
y_sm = concrete['strength']
X_sm = sm.add_constant(X_sm) # adding a constant
lr_sm = sm.OLS(y_sm, X_sm).fit()
pred_sm = lr_sm.predict(X_sm)
print_model = lr_sm.summary()
print(print_model)
OLS Regression Results
==============================================================================
Dep. Variable: strength R-squared: 0.616
Model: OLS Adj. R-squared: 0.613
Method: Least Squares F-statistic: 204.3
Date: Sat, 21 Nov 2020 Prob (F-statistic): 6.29e-206
Time: 22:01:51 Log-Likelihood: -3869.0
No. Observations: 1030 AIC: 7756.
Df Residuals: 1021 BIC: 7800.
Df Model: 8
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const -23.3312 26.586 -0.878 0.380 -75.500 28.837
cement 0.1198 0.008 14.113 0.000 0.103 0.136
slag 0.1039 0.010 10.247 0.000 0.084 0.124
ash 0.0879 0.013 6.988 0.000 0.063 0.113
water -0.1499 0.040 -3.731 0.000 -0.229 -0.071
superplastic 0.2922 0.093 3.128 0.002 0.109 0.476
coarseagg 0.0181 0.009 1.926 0.054 -0.000 0.037
fineagg 0.0202 0.011 1.887 0.059 -0.001 0.041
age 0.1142 0.005 21.046 0.000 0.104 0.125
==============================================================================
Omnibus: 5.378 Durbin-Watson: 1.870
Prob(Omnibus): 0.068 Jarque-Bera (JB): 5.304
Skew: -0.174 Prob(JB): 0.0705
Kurtosis: 3.045 Cond. No. 1.06e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.06e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Observations:
Both of these will indicate that there is multicollinearity (auto-correlation) among the predictor variables.
fig, ax = plt.subplots(figsize=(12,8))
fig = influence_plot(lr_sm, ax= ax, criterion="cooks", size=25)
# Cook's Distance is displayed as size of the data points in below plot.
# Higher the number of large size data points, higher the leverage points.
Setting up a Leverage Cutoff to identify High Leverage points. Usually it is (3k+1)/n where k is No of Columns and n is No of Rows
leverage_cutoff = 3*((X_sm.shape[1] + 1)/X_sm.shape[0])
print("Leverage Cut-off beyond which points are treated as High Leverage Points are : ", round(leverage_cutoff, 3))
Leverage Cut-off beyond which points are treated as High Leverage Points are : 0.029
#Let us visualize few of the data points with high Leverage values but with low t-Residual values
X_sm[X_sm.index.isin([323,462,789,432,914,587])]
| const | cement | slag | ash | water | superplastic | coarseagg | fineagg | age | |
|---|---|---|---|---|---|---|---|---|---|
| 323 | 1.0 | 139.6 | 209.4 | 0.0 | 192.0 | 0.0 | 1047.0 | 806.9 | 360 |
| 432 | 1.0 | 168.0 | 42.1 | 163.8 | 121.8 | 5.7 | 1058.7 | 780.1 | 28 |
| 462 | 1.0 | 168.0 | 42.1 | 163.8 | 121.8 | 5.7 | 1058.7 | 780.1 | 100 |
| 587 | 1.0 | 168.0 | 42.1 | 163.8 | 121.8 | 5.7 | 1058.7 | 780.1 | 3 |
| 789 | 1.0 | 168.0 | 42.1 | 163.8 | 121.8 | 5.7 | 1058.7 | 780.1 | 56 |
| 914 | 1.0 | 168.0 | 42.1 | 163.8 | 121.8 | 5.7 | 1058.7 | 780.1 | 14 |
Since all of these points have a lower Studentised t-residual value, there is no need to remove them from the dataset.
sns.pairplot(concrete, diag_kind='kde');
Observations:
strength displays positive relationship with cement.strength exhibits non-linear relationship which might be hard for us to model using Linear Algorithms.predictor_cols = pd.Series(concrete.iloc[:,0:-1].columns)
for j in range(0,len(predictor_cols)):
g = sns.JointGrid(x=concrete[predictor_cols[j]], y=concrete['strength'], palette='Set2')
g.plot(sns.regplot, sns.boxplot)
As observed earlier, it can be seen that strength variable displays positive relationship only with cement.
for j in range(0,len(predictor_cols)):
sm.qqplot(concrete[predictor_cols[j]], line='r')
Except for cement and water none of the other variables display similar distributions with other features. We can try to derive a new feature based on these two features.
plt.figure(figsize=(10,10))
sns.heatmap(concrete.corr(), annot=True, linewidths = 0.3, fmt = '0.2f', cmap = 'YlGnBu', square=True)
plt.show()
coarseagg is one variable that exhibits negative correltion with every other feature.
strength¶plt.figure(figsize=(15,15))
concrete.corr()[['strength']].sort_values(by='strength', ascending=True).plot(kind='barh')
plt.show()
<Figure size 1080x1080 with 0 Axes>
plt.figure(figsize=(10,10))
sns.scatterplot(y="strength", x="cement", hue="water", size="age", data=concrete, sizes=(50, 300), palette='YlGnBu_r')
plt.title('Effect of Cement, Water, Age on Strength')
plt.show()
strength of the concrete increases with increase in cement in kg/m3. age (data point size) also seems to have slightly positive correlation with strength.
def calculate_vif(X):
vif_features = pd.DataFrame()
vif_features["Features"] = X.columns
vif_features["VIF Score"] = [vif(X.values, i) for i in range(X.shape[1])]
return(vif_features)
vif_df = calculate_vif(concrete.iloc[:,0:-1])
vif_df.sort_values(by='VIF Score', ascending=False)
| Features | VIF Score | |
|---|---|---|
| 5 | coarseagg | 84.955779 |
| 3 | water | 82.157569 |
| 6 | fineagg | 72.790995 |
| 0 | cement | 15.456717 |
| 4 | superplastic | 5.471094 |
| 2 | ash | 4.147833 |
| 1 | slag | 3.329127 |
| 7 | age | 1.699459 |
X = concrete.drop(columns='strength')
y = concrete['strength']
display(X.shape, y.shape)
(1030, 8)
(1030,)
def train_val_test_split(X, y, frac_train=0.60, frac_val=0.20, frac_test=0.20, random_state=None):
'''
Helper function that splits a dataframe into three subsets (train, val, and test)
by running train_test_split() twice.
'''
if frac_train + frac_val + frac_test != 1.0:
raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
(frac_train, frac_val, frac_test))
# Split original dataframe into train and temp dataframes.
df_train, df_temp, y_train, y_temp = train_test_split(X,
y,
test_size=(1.0 - frac_train),
random_state=random_state)
# Split the temp dataframe into val and test dataframes.
relative_frac_test = frac_test / (frac_val + frac_test)
df_val, df_test, y_val, y_test = train_test_split(df_temp,
y_temp,
test_size=relative_frac_test,
random_state=random_state)
assert len(X) == len(df_train) + len(df_val) + len(df_test)
return df_train, df_val, df_test, y_train, y_val, y_test
X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(
X, y, frac_train=0.6, frac_val=0.2, frac_test=0.2, random_state=24)
X_train_org, X_val_org, X_test_org, y_train_org, y_val_org, y_test_org = X_train, X_val, X_test, y_train, y_val, y_test
display(X_train.shape, y_train.shape)
(618, 8)
(618,)
display(X_val.shape, y_val.shape)
(206, 8)
(206,)
display(X_test.shape, y_test.shape)
(206, 8)
(206,)
#log(1+x) transformation is used since the features has '0' as one of the values
print('Features with Outliers & Skewness : ', outlier_treatment_features)
for fts in outlier_treatment_features:
X_train[fts] = np.log1p(X_train[fts])
X_val[fts] = np.log1p(X_val[fts])
X_test[fts] = np.log1p(X_test[fts])
Features with Outliers & Skewness : ['superplastic', 'fineagg', 'slag']
plot_univariate_features(X_train[outlier_treatment_features])
Integer Columns = Index([], dtype='object') Floating Point Columns = Index(['superplastic', 'fineagg', 'slag'], dtype='object') Object Columns = Index([], dtype='object') Category Columns = Index([], dtype='object')
Domain-specific technical know-hows:
From the correlation matrix, we can observe that correlation between
water and strength is -0.29, whereascement and strength is 0.50.X_train.head(3)
| cement | slag | ash | water | superplastic | coarseagg | fineagg | age | |
|---|---|---|---|---|---|---|---|---|
| 511 | 296.0 | 0.00000 | 0.0 | 186.0 | 0.000000 | 1090.0 | 6.646391 | 28 |
| 1011 | 313.0 | 0.00000 | 0.0 | 178.0 | 2.197225 | 1000.0 | 6.712956 | 28 |
| 90 | 139.6 | 5.34901 | 0.0 | 192.0 | 0.000000 | 1047.0 | 6.694438 | 7 |
water_cement_ratio:¶X_train['water_cement_ratio'] = (X_train['water'] / X_train['cement']).round(3)
X_train.head(3)
| cement | slag | ash | water | superplastic | coarseagg | fineagg | age | water_cement_ratio | |
|---|---|---|---|---|---|---|---|---|---|
| 511 | 296.0 | 0.00000 | 0.0 | 186.0 | 0.000000 | 1090.0 | 6.646391 | 28 | 0.628 |
| 1011 | 313.0 | 0.00000 | 0.0 | 178.0 | 2.197225 | 1000.0 | 6.712956 | 28 | 0.569 |
| 90 | 139.6 | 5.34901 | 0.0 | 192.0 | 0.000000 | 1047.0 | 6.694438 | 7 | 1.375 |
plot_univariate_features(X_train[['water_cement_ratio']]);
Integer Columns = Index([], dtype='object') Floating Point Columns = Index(['water_cement_ratio'], dtype='object') Object Columns = Index([], dtype='object') Category Columns = Index([], dtype='object')
X_train = X_train.drop(columns=['water','cement'])
X_train.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| slag | 618.0 | 2.466784 | 2.408843 | 0.000000 | 0.000000 | 2.995732 | 4.964236 | 5.887215 |
| ash | 618.0 | 54.521845 | 64.694942 | 0.000000 | 0.000000 | 0.000000 | 118.300000 | 195.000000 |
| superplastic | 618.0 | 1.448061 | 1.153535 | 0.000000 | 0.000000 | 2.001480 | 2.406945 | 3.502550 |
| coarseagg | 618.0 | 975.311812 | 77.715672 | 801.000000 | 932.000000 | 971.100000 | 1029.400000 | 1145.000000 |
| fineagg | 618.0 | 6.648632 | 0.106110 | 6.388561 | 6.603537 | 6.660703 | 6.714837 | 6.901335 |
| age | 618.0 | 48.621359 | 65.955284 | 3.000000 | 14.000000 | 28.000000 | 56.000000 | 365.000000 |
| water_cement_ratio | 618.0 | 0.746754 | 0.314106 | 0.267000 | 0.542000 | 0.667000 | 0.935000 | 1.882000 |
#Including the Derived Feature in Validation and Test sets independently of Train set
X_val['water_cement_ratio'] = (X_val['water'] / X_val['cement']).round(3)
X_test['water_cement_ratio'] = (X_test['water'] / X_test['cement']).round(3)
X_val = X_val.drop(columns=['water','cement'])
X_test = X_test.drop(columns=['water','cement'])
X_train['superplastic'].value_counts()
0.000000 228
2.533697 23
2.197225 16
2.079442 14
2.174752 9
...
2.639057 1
2.442347 1
2.517696 1
1.609438 1
2.791165 1
Name: superplastic, Length: 99, dtype: int64
X_train[(X_train['ash']==0) & (X_train['superplastic']==0)]
| slag | ash | superplastic | coarseagg | fineagg | age | water_cement_ratio | |
|---|---|---|---|---|---|---|---|
| 511 | 0.000000 | 0.0 | 0.0 | 1090.0 | 6.646391 | 28 | 0.628 |
| 90 | 5.349010 | 0.0 | 0.0 | 1047.0 | 6.694438 | 7 | 1.375 |
| 896 | 0.000000 | 0.0 | 0.0 | 940.6 | 6.667720 | 28 | 0.489 |
| 335 | 0.000000 | 0.0 | 0.0 | 974.0 | 6.654153 | 90 | 0.580 |
| 468 | 0.000000 | 0.0 | 0.0 | 1111.0 | 6.665684 | 28 | 0.734 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 850 | 3.881564 | 0.0 | 0.0 | 932.0 | 6.388561 | 270 | 0.533 |
| 1007 | 5.262690 | 0.0 | 0.0 | 932.0 | 6.577583 | 3 | 0.667 |
| 207 | 5.252273 | 0.0 | 0.0 | 932.0 | 6.508769 | 180 | 1.200 |
| 145 | 4.817051 | 0.0 | 0.0 | 959.2 | 6.685861 | 90 | 1.107 |
| 343 | 5.674010 | 0.0 | 0.0 | 998.2 | 6.558623 | 28 | 0.960 |
223 rows × 7 columns
age Feature¶X_train_fe = X_train.copy(deep=True)
X_train_fe['age'].hist(bins=50)
<AxesSubplot:>
bins = [0,30,60,150,300,365]
age_groups = ['0_30days','30_60days','60_150days','150_300days','beyond_300days']
X_train_fe['age_group'] = pd.cut(X_train_fe['age'], bins, labels=age_groups)
X_train_fe[['age_group']].describe().T
| count | unique | top | freq | |
|---|---|---|---|---|
| age_group | 618 | 5 | 0_30days | 441 |
X_train_fe.head(5)
| slag | ash | superplastic | coarseagg | fineagg | age | water_cement_ratio | age_group | |
|---|---|---|---|---|---|---|---|---|
| 511 | 0.00000 | 0.0 | 0.000000 | 1090.0 | 6.646391 | 28 | 0.628 | 0_30days |
| 1011 | 0.00000 | 0.0 | 2.197225 | 1000.0 | 6.712956 | 28 | 0.569 | 0_30days |
| 90 | 5.34901 | 0.0 | 0.000000 | 1047.0 | 6.694438 | 7 | 1.375 | 0_30days |
| 1006 | 0.00000 | 195.0 | 2.484907 | 898.0 | 6.570883 | 28 | 1.392 | 0_30days |
| 512 | 0.00000 | 112.6 | 2.406945 | 925.3 | 6.664281 | 28 | 0.541 | 0_30days |
age_group (Transformed Feature)¶# Replace the numbers in categorical variables with Ordinal values
X_train_fe['age_group'] = X_train_fe['age_group'].replace({
'0_30days':1, '30_60days':2,'60_150days':3,'150_300days':4,'beyond_300days':5})
X_train_fe.head()
| slag | ash | superplastic | coarseagg | fineagg | age | water_cement_ratio | age_group | |
|---|---|---|---|---|---|---|---|---|
| 511 | 0.00000 | 0.0 | 0.000000 | 1090.0 | 6.646391 | 28 | 0.628 | 1 |
| 1011 | 0.00000 | 0.0 | 2.197225 | 1000.0 | 6.712956 | 28 | 0.569 | 1 |
| 90 | 5.34901 | 0.0 | 0.000000 | 1047.0 | 6.694438 | 7 | 1.375 | 1 |
| 1006 | 0.00000 | 195.0 | 2.484907 | 898.0 | 6.570883 | 28 | 1.392 | 1 |
| 512 | 0.00000 | 112.6 | 2.406945 | 925.3 | 6.664281 | 28 | 0.541 | 1 |
X_train_fe.drop(columns='age', inplace=True)
X_train_fe.sample(3)
| slag | ash | superplastic | coarseagg | fineagg | water_cement_ratio | age_group | |
|---|---|---|---|---|---|---|---|
| 855 | 4.675629 | 0.0 | 2.975530 | 936.0 | 6.690470 | 0.356 | 1 |
| 394 | 3.933784 | 173.5 | 2.014903 | 1006.2 | 6.677713 | 0.950 | 1 |
| 27 | 4.675629 | 0.0 | 2.862201 | 852.1 | 6.789084 | 0.361 | 3 |
Note: age variable is binned here, however not used in the final model since binning results in significant loss of information and resulting in degraded model performance.
#Repeating the Feature Engineering steps for Validation & Test sets independently
# X_val['age_group'] = pd.cut(X_val['age'], bins, labels=age_groups)
# X_val['age_group'] = X_val['age_group'].replace({
# '0_30days':1, '30_60days':2,'60_150days':3,'150_300days':4,'beyond_300days':5})
# X_test['age_group'] = pd.cut(X_test['age'], bins, labels=age_groups)
# X_test['age_group'] = X_test['age_group'].replace({
# '0_30days':1, '30_60days':2,'60_150days':3,'150_300days':4,'beyond_300days':5})
# X_val.drop(columns='age', inplace=True)
# X_test.drop(columns='age', inplace=True)
scaler = RobustScaler(copy=True)
cols = X_train.columns
# Fit on Train set
scaler.fit(X_train.values)
# Transform Train, Validation and Test sets
X_train = scaler.transform(X_train.values)
X_val = scaler.transform(X_val.values)
X_test = scaler.transform(X_test.values)
X_train = pd.DataFrame(X_train, columns=cols)
X_train.head(3)
| slag | ash | superplastic | coarseagg | fineagg | age | water_cement_ratio | |
|---|---|---|---|---|---|---|---|
| 0 | -1.024883 | -0.843435 | -1.256342 | 1.476936 | -0.021143 | -0.312910 | -0.378376 |
| 1 | -1.024883 | -0.843435 | 0.649977 | 0.317931 | 0.606693 | -0.312910 | -0.566363 |
| 2 | 1.197488 | -0.843435 | -1.256342 | 0.923189 | 0.432036 | -0.631565 | 2.001727 |
X_val = pd.DataFrame(X_val, columns=cols)
X_test = pd.DataFrame(X_test, columns=cols)
plt.figure(figsize=(10,10))
sns.clustermap(X_train, method='average', metric='euclidean',
dendrogram_ratio=(.1, .2),
cbar_pos=(0, .2, .03, .4),
row_cluster=False, cmap="mako")
plt.title("Cluster Map")
plt.show()
<Figure size 720x720 with 0 Axes>
Finding the Optimal Number of Clusters
clusters=range(1,10)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(X_train)
prediction=model.predict(X_train)
meanDistortions.append(sum(np.min(cdist(X_train, model.cluster_centers_, 'euclidean'), axis=1)) / X_train.shape[0])
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
Observations:
range_n_clusters = [2,3,4,5,6]
for n_clusters in range_n_clusters:
fig, (ax1) = plt.subplots(1)
fig.set_size_inches(12, 7)
ax1.set_xlim([-0.1, 1])
ax1.set_ylim([0, len(X_train) + (n_clusters + 1) * 10])
clusterer = KMeans(n_clusters=n_clusters, random_state=24)
cluster_labels = clusterer.fit_predict(X_train)
silhouette_avg = silhouette_score(X_train, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg.round(3))
sample_silhouette_values = silhouette_samples(X_train, cluster_labels)
y_lower = 10
for i in range(n_clusters):
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
ax1.axvline(x=silhouette_avg, color="blue", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
plt.show()
For n_clusters = 2 The average silhouette_score is : 0.243 For n_clusters = 3 The average silhouette_score is : 0.243 For n_clusters = 4 The average silhouette_score is : 0.273 For n_clusters = 5 The average silhouette_score is : 0.299 For n_clusters = 6 The average silhouette_score is : 0.31
Observations:
# Let us first start with K = 4 based on Silhouette Plot
final_model=KMeans(n_clusters=4)
final_model.fit(X_train)
prediction_train = final_model.predict(X_train)
#Append the prediction
X_train["CLUSTER"] = prediction_train
print("Clusters Assigned : \n")
X_train.head()
Clusters Assigned :
| slag | ash | superplastic | coarseagg | fineagg | age | water_cement_ratio | CLUSTER | |
|---|---|---|---|---|---|---|---|---|
| 0 | -1.024883 | -0.843435 | -1.256342 | 1.476936 | -0.021143 | -0.312910 | -0.378376 | 1 |
| 1 | -1.024883 | -0.843435 | 0.649977 | 0.317931 | 0.606693 | -0.312910 | -0.566363 | 3 |
| 2 | 1.197488 | -0.843435 | -1.256342 | 0.923189 | 0.432036 | -0.631565 | 2.001727 | 2 |
| 3 | -1.024883 | 2.173152 | 0.899570 | -0.995609 | -0.733317 | -0.312910 | 2.055893 | 3 |
| 4 | -1.024883 | 0.898450 | 0.831931 | -0.644044 | 0.147602 | -0.312910 | -0.655577 | 3 |
X_train.CLUSTER.value_counts().sort_index()
0 130 1 155 2 142 3 191 Name: CLUSTER, dtype: int64
X_train.groupby(['CLUSTER']).mean()
| slag | ash | superplastic | coarseagg | fineagg | age | water_cement_ratio | |
|---|---|---|---|---|---|---|---|
| CLUSTER | |||||||
| 0 | 0.766503 | -0.424338 | 0.891671 | -0.695397 | 0.037921 | -0.222916 | -0.907165 |
| 1 | -0.807766 | -0.837547 | -1.182082 | 0.498270 | -0.271120 | 0.639238 | -0.538694 |
| 2 | 1.105508 | -0.263421 | -0.399203 | -0.248323 | -0.313169 | -0.250718 | 1.200910 |
| 3 | -0.688082 | 1.164343 | 0.649174 | 0.253568 | 0.427036 | -0.180633 | 0.161779 |
X_train.boxplot(by='CLUSTER', layout = (2,4),figsize=(15,10));
Observations on the different Clusters:
water_cement_ratio is increasing in ascending order for the clusters 0,1,3 and 2. ash and slag all the 0 values which we have observed in Univariate Analysis are clustered into a single Cluster (Cluster 1). prediction_val = final_model.predict(X_val)
prediction_test = final_model.predict(X_test)
#Append the prediction to Validation and Test sets
X_val["CLUSTER"] = prediction_val
X_test["CLUSTER"] = prediction_test
Let us plot the Clusters by considering two of the major features age and water_cement_ratio
sns.jointplot(data=X_train, x='age', y='water_cement_ratio', hue='CLUSTER', kind='scatter', height=8, ratio=5, color='g');
We can see that the Clusters identified by KMeans is not proper as several of the data points are overlapping (since it considers only 'mean').
Gaussian Mixture Models consider the mean and variance (distribution) of the data points.
# Gaussian Mixture Model to Cluster based on Distribution of Data
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4, random_state=18)
data = X_train[['age', 'water_cement_ratio']]
#data = X_train[['age', 'water_cement_ratio', 'superplastic', 'ash']]
gmm.fit(data)
GaussianMixture(n_components=4, random_state=18)
#predictions from gmm
labels = gmm.predict(data)
gmmCluster = pd.DataFrame(data)
gmmCluster['cluster'] = labels
gmmCluster.columns = ['age', 'water_cement_ratio', 'cluster']
color=['blue','green','orange', 'purple']
for k in range(0,4):
data = gmmCluster[gmmCluster["cluster"]==k]
plt.scatter(data["age"],data["water_cement_ratio"],c=color[k])
plt.show()
gmmCluster.head()
| age | water_cement_ratio | cluster | |
|---|---|---|---|
| 0 | -0.312910 | -0.378376 | 0 |
| 1 | -0.312910 | -0.566363 | 0 |
| 2 | -0.631565 | 2.001727 | 2 |
| 3 | -0.312910 | 2.055893 | 2 |
| 4 | -0.312910 | -0.655577 | 0 |
gmmCluster.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| age | 618.0 | -1.022196e-16 | 1.000810 | -0.692262 | -0.525347 | -0.312910 | 0.111964 | 4.800751 |
| water_cement_ratio | 618.0 | -4.095968e-17 | 1.000810 | -1.528600 | -0.652391 | -0.254114 | 0.599792 | 3.617139 |
| cluster | 618.0 | 1.189320e+00 | 1.297579 | 0.000000 | 0.000000 | 0.000000 | 2.750000 | 3.000000 |
sns.jointplot(
data=gmmCluster,
x='age', y='water_cement_ratio',
hue='cluster',
kind='scatter',
height=8, ratio=5,
palette='Set2'
);
sns.pairplot(data=X_train, hue='CLUSTER', diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x282b51d9af0>
sns.pairplot(data=gmmCluster, hue='cluster', diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x282b50c22e0>
sns.lmplot(
x='age',
y='water_cement_ratio',
data=X_train,
hue='CLUSTER');
gmmCluster.boxplot(by='cluster', layout = (1,2),figsize=(10,6));
Observations:
water_cement_ratio has been clustered in ascending order for clusters 0,1,3 and 2. (similar to KMeans).age variable is clustered into multiple clusters (similar to our expectations from Univariate Analysis).Baseline Models - both linear and tree based would be built to check on the Model Complexity.
linr_reg_org = LinearRegression(copy_X=True, n_jobs=-1)
linr_reg_org.fit(X_train_org, y_train_org)
LinearRegression(n_jobs=-1)
linr_reg_org.score(X_train_org, y_train_org)
0.6243664962922777
y_pred_lr_org = linr_reg_org.predict(X_val_org)
rmse_lr_org = np.sqrt(mean_squared_error(y_val_org, y_pred_lr_org))
mae_lr_org = mean_absolute_error(y_val_org, y_pred_lr_org)
r2_lr_org = r2_score(y_val_org, y_pred_lr_org)
print("Model\t\t\t RMSE \t\t MAE \t\t R2")
print("""LinearRegression \t {:.2f} \t\t{:.2f} \t\t{:.2f}""".format(rmse_lr_org, mae_lr_org, r2_lr_org))
Model RMSE MAE R2 LinearRegression 10.42 8.24 0.63
linr_reg = LinearRegression(copy_X=True, n_jobs=-1)
linr_reg.fit(X_train, y_train)
LinearRegression(n_jobs=-1)
linr_reg.score(X_train, y_train)
0.5891697201725308
y_pred_lr = linr_reg.predict(X_val)
for idx, col_name in enumerate(X_train.columns):
print("The coefficient for {} is {}".format(col_name, linr_reg.coef_[idx]))
The coefficient for slag is 4.445924855745738 The coefficient for ash is -0.7581212306317845 The coefficient for superplastic is 5.986061628412468 The coefficient for coarseagg is -0.04635785013846914 The coefficient for fineagg is -1.223791258383809 The coefficient for age is 6.156905617330509 The coefficient for water_cement_ratio is -8.548560874311438 The coefficient for CLUSTER is -0.5874269914983771
intercept = linr_reg.intercept_
print("The intercept for our model is {}".format(intercept))
The intercept for our model is 36.92204873041482
plt.figure(figsize=(12,6))
sns.barplot(y=X_train.columns, x=linr_reg.coef_, orient='h')
plt.title('Coefficient Plot for Linear Regression')
plt.show()
rmse_lr = np.sqrt(mean_squared_error(y_val, y_pred_lr))
mae_lr = mean_absolute_error(y_val, y_pred_lr)
r2_lr = r2_score(y_val, y_pred_lr)
print("Model\t\t\t RMSE \t\t MAE \t\t R2")
print("""LinearRegression \t {:.2f} \t\t{:.2f} \t\t{:.2f}""".format(rmse_lr, mae_lr, r2_lr))
Model RMSE MAE R2 LinearRegression 10.88 8.73 0.60
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
poly.fit(X_train_org)
PolynomialFeatures(include_bias=False)
X_train_poly = poly.transform(X_train_org)
X_val_poly = poly.transform(X_val_org)
X_train_poly.shape
(618, 54)
print('Polynomial Features of Degree 2 : ',poly.get_feature_names())
Polynomial Features of Degree 2 : ['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7', 'x2 x8', 'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7', 'x3 x8', 'x4^2', 'x4 x5', 'x4 x6', 'x4 x7', 'x4 x8', 'x5^2', 'x5 x6', 'x5 x7', 'x5 x8', 'x6^2', 'x6 x7', 'x6 x8', 'x7^2', 'x7 x8', 'x8^2']
X_train_poly = pd.DataFrame(X_train_poly, columns=poly.get_feature_names())
X_val_poly = pd.DataFrame(X_val_poly, columns=poly.get_feature_names())
linR_poly_reg = LinearRegression(n_jobs=-1)
linR_poly_reg.fit(X_train_poly, y_train)
LinearRegression(n_jobs=-1)
linR_poly_reg.score(X_train_poly, y_train)
0.836611412521677
y_pred__poly_lr = linR_poly_reg.predict(X_val_poly)
linR_poly_reg.coef_[0]
3.4996301570608472
plt.figure(figsize=(26,16))
sns.barplot(x=X_train_poly.columns, y=linR_poly_reg.coef_)
plt.title('Coefficient Plot for Polynomial Linear Regression')
plt.show()
rmse_poly_lr = np.sqrt(mean_squared_error(y_val, y_pred__poly_lr))
mae_poly_lr = mean_absolute_error(y_val, y_pred__poly_lr)
r2_poly_lr = r2_score(y_val, y_pred__poly_lr)
print("Model\t\t\t\t\t RMSE \t\t MAE \t\t R2")
print("""Polynominal Linear Regression \t\t {:.2f} \t\t{:.2f} \t\t{:.2f}""".format(rmse_poly_lr, mae_poly_lr, r2_poly_lr))
Model RMSE MAE R2 Polynominal Linear Regression 7.57 5.88 0.81
svr_reg = SVR(kernel='rbf', degree=5, C=1.0)
svr_reg.fit(X_train, y_train)
SVR(degree=5)
svr_reg.score(X_train, y_train)
0.6266809987007861
y_pred_svr = svr_reg.predict(X_val)
rmse_svr = np.sqrt(mean_squared_error(y_val, y_pred_svr))
mae_svr = mean_absolute_error(y_val, y_pred_svr)
r2_svr = r2_score(y_val, y_pred_svr)
print("Model\t\t RMSE \t\t MAE \t\t R2")
print("""SVR \t\t {:.2f} \t\t{:.2f} \t\t{:.2f}""".format(rmse_svr, mae_svr, r2_svr))
Model RMSE MAE R2 SVR 10.44 8.47 0.63
dt_reg = DecisionTreeRegressor(criterion='mse', random_state=24)
dt_reg.fit(X_train, y_train)
DecisionTreeRegressor(random_state=24)
dt_reg.score(X_train, y_train)
0.9953071090716287
y_pred_dtr = dt_reg.predict(X_val)
rmse_dtr = np.sqrt(mean_squared_error(y_val, y_pred_dtr))
mae_dtr = mean_absolute_error(y_val, y_pred_dtr)
r2_dtr = r2_score(y_val, y_pred_dtr)
print("Model\t\t RMSE \t\t MAE \t\t R2")
print("""DT \t\t {:.2f} \t\t{:.2f} \t\t{:.2f}""".format(rmse_dtr, mae_dtr, r2_dtr))
Model RMSE MAE R2 DT 6.16 4.17 0.87
# Predicted vs Ground-Truth Plot for all the Baseline Models
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=(16,4))
ax1.scatter(y_pred_lr, y_val, s=20)
ax1.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
ax1.set_ylabel("True")
ax1.set_xlabel("Predicted")
ax1.set_title("Linear Regressor")
ax2.scatter(y_pred__poly_lr, y_val, s=20)
ax2.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
ax2.set_ylabel("True")
ax2.set_xlabel("Predicted")
ax2.set_title("Polynomial Features Linear Regressor")
ax3.scatter(y_pred_svr, y_val, s=20)
ax3.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
ax3.set_ylabel("True")
ax3.set_xlabel("Predicted")
ax3.set_title("RBF Kernel Regressor")
ax4.scatter(y_pred_dtr, y_val, s=20)
ax4.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
ax4.set_ylabel("True")
ax4.set_xlabel("Predicted")
ax4.set_title("Tree Regressor")
fig.suptitle("Predicted vs Ground-Truth Plot\n\n")
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
Observations:
#Checking the residuals of each Predictor
#Ideal scenario - No visible structure should be seen in lowess smoother
#Checking the important features `water_cement_ratio` and `age`
fig = plt.figure(figsize=(8,6))
sns.residplot(x= X_val['water_cement_ratio'], y= y_val, color='green', lowess=True )
fig = plt.figure(figsize=(8,6))
sns.residplot(x= X_val['age'], y= y_val, color='blue', lowess=True )
fig = plt.figure(figsize=(8,6))
sns.residplot(x= X_val['superplastic'], y= y_val, color='orange', lowess=True )
fig = plt.figure(figsize=(8,6))
sns.residplot(x= X_val['slag'], y= y_val, color='brown', lowess=True )
<AxesSubplot:xlabel='slag', ylabel='strength'>
#Helper function to plot Regression Plots for Predictors vs Target with given Polynomial Degree
def fit_polynomial(n_degree):
p = np.polyfit(X_train['water_cement_ratio'], y_train, deg=n_degree)
X_train['fit'] = np.polyval(p, X_train['water_cement_ratio'])
sns.regplot(X_train['water_cement_ratio'], y_train, fit_reg=False)
return plt.plot(X_train['water_cement_ratio'], X_train['fit'], label='fit')
f, ((ax1,ax2,ax3)) = plt.subplots(1, 3, figsize = (15, 8))
f.suptitle('Fitting different degree of Polynomials to conclude on Model Complexity', fontsize = 14)
plt.subplot(ax1)
fit_polynomial(1)
plt.title('Degree=1')
plt.subplot(ax2)
fit_polynomial(2)
plt.title('Degree=2')
plt.subplot(ax3)
fit_polynomial(10)
plt.title('Degree=20')
X_train = X_train.drop(columns='fit')
rmse_df = pd.DataFrame(columns=['degree', 'rmse_train', 'rmse_val'])
def get_rmse(y, y_fit):
return np.sqrt(mean_squared_error(y, y_fit))
for i in range(1,20):
p = np.polyfit(X_train['water_cement_ratio'], y_train, deg=i)
rmse_df.loc[i-1] = [i,
get_rmse(y_train, np.polyval(p, X_train['water_cement_ratio'])),
get_rmse(y_val, np.polyval(p, X_val['water_cement_ratio']))
]
rmse_df
| degree | rmse_train | rmse_val | |
|---|---|---|---|
| 0 | 1.0 | 14.065837 | 15.293120 |
| 1 | 2.0 | 13.635942 | 14.593372 |
| 2 | 3.0 | 13.409363 | 14.315404 |
| 3 | 4.0 | 13.382672 | 14.349907 |
| 4 | 5.0 | 13.381700 | 14.350334 |
| 5 | 6.0 | 13.340273 | 14.368166 |
| 6 | 7.0 | 13.336238 | 14.391429 |
| 7 | 8.0 | 13.330018 | 14.396066 |
| 8 | 9.0 | 13.320632 | 14.449220 |
| 9 | 10.0 | 13.308120 | 14.446808 |
| 10 | 11.0 | 13.306543 | 14.444971 |
| 11 | 12.0 | 13.293541 | 14.501509 |
| 12 | 13.0 | 13.248190 | 14.713410 |
| 13 | 14.0 | 13.248188 | 14.713184 |
| 14 | 15.0 | 13.234598 | 14.679688 |
| 15 | 16.0 | 13.228791 | 14.702419 |
| 16 | 17.0 | 13.228104 | 14.706898 |
| 17 | 18.0 | 13.217768 | 14.776444 |
| 18 | 19.0 | 13.217750 | 14.774522 |
#Plotting the rmse for both train and validation set for different degrees of Polynomial
plt.plot(rmse_df['degree'], rmse_df['rmse_train'], label='RMSE Train', color='b')
plt.plot(rmse_df['degree'], rmse_df['rmse_val'], label='RMSE Validation', color='g')
plt.legend(bbox_to_anchor=(1.05,1), loc=2, borderaxespad=0.);
plt.xlabel('Model Degrees')
plt.ylabel('RMSE')
Text(0, 0.5, 'RMSE')
Let us use RandomForestRegressor since tree-based baseline model has been the best of the lot.
rf_reg = RandomForestRegressor(
n_estimators=100,
criterion='mse',
bootstrap=True,
oob_score=True,
n_jobs=-1,
random_state=24
)
rf_reg.fit(X_train, y_train)
RandomForestRegressor(n_jobs=-1, oob_score=True, random_state=24)
y_pred_rfr = rf_reg.predict(X_val)
rf_reg.score(X_train, y_train)
0.982042159608117
rf_reg.oob_score_
0.8969945881526643
rmse_rfr = np.sqrt(mean_squared_error(y_val, y_pred_rfr))
mae_rfr = mean_absolute_error(y_val, y_pred_rfr)
r2_rfr = r2_score(y_val, y_pred_rfr)
print("Model\t\t\t RMSE \t\t MAE \t\t R2")
print("""RF Regressor \t\t {:.2f} \t\t{:.2f} \t\t{:.2f}""".format(rmse_rfr, mae_rfr, r2_rfr))
Model RMSE MAE R2 RF Regressor 5.43 3.85 0.90
def plot_feature_importance(reg_model):
feature_rank = pd.DataFrame({
'feature' : X_train.columns,
'importance' : reg_model.feature_importances_.round(3)
}
)
feature_rank = feature_rank.sort_values('importance', ascending=False)
print("Amount of variance explained by the predictors\n")
feature_rank['cumulative_sum'] = feature_rank['importance'].cumsum()*100
plt.figure(figsize=(8,6))
sns.barplot(y='feature', x='importance', data=feature_rank)
plt.title('Feature Importances Plot')
plt.xlabel('Feature Importance from Regressor')
plt.show()
return feature_rank
plot_feature_importance(rf_reg)
Amount of variance explained by the predictors
| feature | importance | cumulative_sum | |
|---|---|---|---|
| 5 | age | 0.348 | 34.8 |
| 6 | water_cement_ratio | 0.347 | 69.5 |
| 7 | CLUSTER | 0.109 | 80.4 |
| 0 | slag | 0.063 | 86.7 |
| 2 | superplastic | 0.047 | 91.4 |
| 4 | fineagg | 0.039 | 95.3 |
| 3 | coarseagg | 0.036 | 98.9 |
| 1 | ash | 0.012 | 100.1 |
feature_selected_by_fi = ['water_cement_ratio', 'age', 'slag', 'superplastic']
from sklearn.inspection import permutation_importance
result = permutation_importance(rf_reg, X_train, y_train, n_repeats=10, random_state=42, n_jobs=-1)
sorted_idx = result.importances_mean.argsort()
plt.figure(figsize=(16,8))
plt.boxplot(result.importances[sorted_idx].T, vert=False, labels=X_train.columns[sorted_idx])
plt.title("Permutation Importances")
plt.show()
sfm = SelectFromModel(estimator=rf_reg)
sfm.fit(X_train, y_train)
X_train_sfm = sfm.transform(X_train)
support = sfm.get_support()
col = X_train.columns
feature_selected_by_sfm = [x for x, y in zip(col, support) if y == True]
print(feature_selected_by_sfm)
['age', 'water_cement_ratio']
sfs = SFS(rf_reg,
k_features=4,
forward=True,
floating=False,
scoring='r2',
verbose=2,
cv=5)
sfs = sfs.fit(X_train, y_train)
plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.ylim([0.0, 1])
plt.title('Sequential Forward Selection (w. StdDev)')
plt.show()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 5.9s finished [2020-11-21 22:02:35] Features: 1/4 -- score: 0.3790929450847576[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.6s remaining: 0.0s [Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 5.0s finished [2020-11-21 22:02:40] Features: 2/4 -- score: 0.6777904747692061[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.7s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 4.4s finished [2020-11-21 22:02:44] Features: 3/4 -- score: 0.8258205913263741[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.7s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 3.6s finished [2020-11-21 22:02:48] Features: 4/4 -- score: 0.8602221848536373
columnList = list(X_train.columns)
feat_cols = list(sfs.k_feature_idx_)
print(feat_cols)
feature_selected_by_sfs = [columnList[i] for i in feat_cols]
print(feature_selected_by_sfs)
[0, 1, 5, 6] ['slag', 'ash', 'age', 'water_cement_ratio']
feature_selected_by_fi = set(feature_selected_by_fi)
feature_selected_by_sfm = set(feature_selected_by_sfm)
feature_selected_by_sfs = set(feature_selected_by_sfs)
feature_selected = feature_selected_by_fi.union(feature_selected_by_sfm.union(feature_selected_by_sfs))
print('Features Selected soft selection : ', feature_selected)
Features Selected soft selection : {'ash', 'water_cement_ratio', 'age', 'slag', 'superplastic'}
#Helper function to Build & Evaluate the Model
def build_regressor(reg_model, Xs, ys):
y_pred = reg_model.predict(Xs)
print('Model Score (Adjusted R2) of Train set : ', reg_model.score(X_train, y_train).round(3))
#print('\nOut of Bag Score : ', reg_model.oob_score_.round(3))
print('\n')
rmse = np.sqrt(mean_squared_error(ys, y_pred))
mae = mean_absolute_error(ys, y_pred)
r2 = r2_score(ys, y_pred)
print("Model\t\t\t\t RMSE \t\t MAE \t\t R2")
print("""Regressor on Validation Set \t {:.2f} \t\t{:.2f} \t\t{:.2f}""".format(rmse, mae, r2))
print('\n')
# Predicted vs Ground-Truth Plot for the Regressor
plt.figure(figsize=(8,6))
plt.scatter(y_pred, ys, s=20)
plt.plot([ys.min(), ys.max()], [ys.min(), ys.max()], 'k--', lw=2)
plt.ylabel("True")
plt.xlabel("Predicted")
plt.title("Predicted vs Ground-Truth Plot\n\n")
plt.show()
return y_pred, rmse, mae, r2
# Helper function to calculate & plot Cross Validation Score
def plot_cross_val_score(reg_model, Xs, ys, cv=10, alpha=0.95, scoring = 'r2'):
model_cv_score = cross_val_score(reg_model, Xs, ys, cv = cv, scoring = scoring)
model_cv_score_mean = model_cv_score.mean()
model_cv_score_sd = model_cv_score.std()
print('Cross validation score (Mean): ', round(model_cv_score_mean, 3).astype(str))
print('Cross validation score (Std Dev): ', round(model_cv_score_sd, 3).astype(str))
conf_interval_lower = model_cv_score_mean - 2*model_cv_score_sd
conf_interval_upper = model_cv_score_mean + 2*model_cv_score_sd
print('CV Score Mean+-3SD : [',str(conf_interval_lower)+', '+str(conf_interval_upper)+']')
plt.subplot(211)
sns.distplot(model_cv_score)
plt.subplot(212)
sns.boxplot(model_cv_score)
plt.suptitle("Cross Validation Score Distribution")
# Confidence Interval of alpha %
p = ((1.0-alpha)/2.0) * 100
lower = max(0.0, np.percentile(model_cv_score, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(model_cv_score, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
plt.show()
return model_cv_score
y_pred_rfr, rmse_rfr, mae_rfr, r2_rfr = build_regressor(rf_reg, X_val, y_val)
Model Score (Adjusted R2) of Train set : 0.982 Model RMSE MAE R2 Regressor on Validation Set 5.43 3.85 0.90
plot_feature_importance(rf_reg)
Amount of variance explained by the predictors
| feature | importance | cumulative_sum | |
|---|---|---|---|
| 5 | age | 0.348 | 34.8 |
| 6 | water_cement_ratio | 0.347 | 69.5 |
| 7 | CLUSTER | 0.109 | 80.4 |
| 0 | slag | 0.063 | 86.7 |
| 2 | superplastic | 0.047 | 91.4 |
| 4 | fineagg | 0.039 | 95.3 |
| 3 | coarseagg | 0.036 | 98.9 |
| 1 | ash | 0.012 | 100.1 |
param_grid_rf = {
'criterion' : ['mse', 'mae'], #Impurity split criteria
'n_estimators' : [100, 200, 300, 500],
'max_depth' : [3,4,5,6] #Tree Depth
}
rfr_tuned = GridSearchCV(
RandomForestRegressor(n_jobs=-1, random_state=24, verbose=2),
param_grid=param_grid_rf,
scoring='r2',
cv=KFold(n_splits=10),
n_jobs=-1
)
rfr_tuned.fit(X_train, y_train)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 0.0s
building tree 1 of 200building tree 2 of 200 building tree 3 of 200 building tree 4 of 200 building tree 5 of 200 building tree 6 of 200 building tree 7 of 200building tree 8 of 200 building tree 9 of 200 building tree 10 of 200 building tree 11 of 200 building tree 12 of 200 building tree 13 of 200 building tree 14 of 200 building tree 15 of 200 building tree 16 of 200 building tree 17 of 200 building tree 18 of 200 building tree 19 of 200 building tree 20 of 200 building tree 21 of 200 building tree 22 of 200 building tree 23 of 200 building tree 24 of 200 building tree 25 of 200 building tree 26 of 200 building tree 27 of 200 building tree 28 of 200 building tree 29 of 200 building tree 30 of 200 building tree 31 of 200 building tree 32 of 200 building tree 33 of 200 building tree 34 of 200 building tree 35 of 200 building tree 36 of 200 building tree 37 of 200 building tree 38 of 200 building tree 39 of 200 building tree 40 of 200building tree 41 of 200 building tree 42 of 200 building tree 43 of 200building tree 44 of 200 building tree 45 of 200 building tree 46 of 200 building tree 47 of 200 building tree 48 of 200 building tree 49 of 200 building tree 50 of 200 building tree 51 of 200 building tree 52 of 200 building tree 53 of 200 building tree 54 of 200 building tree 55 of 200 building tree 56 of 200 building tree 57 of 200 building tree 58 of 200 building tree 59 of 200 building tree 60 of 200 building tree 61 of 200 building tree 62 of 200 building tree 63 of 200 building tree 64 of 200 building tree 65 of 200 building tree 66 of 200building tree 67 of 200 building tree 68 of 200 building tree 69 of 200building tree 70 of 200 building tree 71 of 200 building tree 72 of 200 building tree 73 of 200 building tree 74 of 200 building tree 75 of 200 building tree 76 of 200 building tree 77 of 200 building tree 78 of 200 building tree 79 of 200 building tree 80 of 200 building tree 81 of 200 building tree 82 of 200 building tree 83 of 200 building tree 84 of 200 building tree 85 of 200 building tree 86 of 200 building tree 87 of 200 building tree 88 of 200 building tree 89 of 200 building tree 90 of 200 building tree 91 of 200 building tree 92 of 200 building tree 93 of 200 building tree 94 of 200 building tree 95 of 200 building tree 96 of 200 building tree 97 of 200 building tree 98 of 200 building tree 99 of 200 building tree 100 of 200 building tree 101 of 200 building tree 102 of 200 building tree 103 of 200 building tree 104 of 200 building tree 105 of 200 building tree 106 of 200 building tree 107 of 200 building tree 108 of 200 building tree 109 of 200 building tree 110 of 200building tree 111 of 200 building tree 112 of 200 building tree 113 of 200 building tree 114 of 200 building tree 115 of 200 building tree 116 of 200 building tree 117 of 200 building tree 118 of 200 building tree 119 of 200 building tree 120 of 200 building tree 121 of 200 building tree 122 of 200 building tree 123 of 200 building tree 124 of 200 building tree 125 of 200 building tree 126 of 200 building tree 127 of 200 building tree 128 of 200 building tree 129 of 200 building tree 130 of 200 building tree 131 of 200 building tree 132 of 200 building tree 133 of 200 building tree 134 of 200 building tree 135 of 200 building tree 136 of 200 building tree 137 of 200 building tree 138 of 200 building tree 139 of 200 building tree 140 of 200 building tree 141 of 200 building tree 142 of 200 building tree 143 of 200 building tree 144 of 200 building tree 145 of 200 building tree 146 of 200 building tree 147 of 200 building tree 148 of 200 building tree 149 of 200 building tree 150 of 200 building tree 151 of 200 building tree 152 of 200 building tree 153 of 200 building tree 154 of 200 building tree 155 of 200 building tree 156 of 200 building tree 157 of 200 building tree 158 of 200 building tree 159 of 200 building tree 160 of 200building tree 161 of 200 building tree 162 of 200 building tree 163 of 200 building tree 164 of 200 building tree 165 of 200 building tree 166 of 200 building tree 167 of 200 building tree 168 of 200 building tree 169 of 200 building tree 170 of 200building tree 171 of 200 building tree 172 of 200building tree 173 of 200 building tree 174 of 200 building tree 175 of 200 building tree 176 of 200 building tree 177 of 200 building tree 178 of 200 building tree 179 of 200building tree 180 of 200 building tree 181 of 200 building tree 182 of 200 building tree 183 of 200 building tree 184 of 200 building tree 185 of 200 building tree 186 of 200 building tree 187 of 200 building tree 188 of 200 building tree 189 of 200 building tree 190 of 200 building tree 191 of 200 building tree 192 of 200 building tree 193 of 200building tree 194 of 200 building tree 195 of 200 building tree 196 of 200 building tree 197 of 200 building tree 198 of 200 building tree 199 of 200 building tree 200 of 200
[Parallel(n_jobs=-1)]: Done 138 tasks | elapsed: 0.2s [Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed: 0.3s finished
GridSearchCV(cv=KFold(n_splits=10, random_state=None, shuffle=False),
estimator=RandomForestRegressor(n_jobs=-1, random_state=24,
verbose=2),
n_jobs=-1,
param_grid={'criterion': ['mse', 'mae'], 'max_depth': [3, 4, 5, 6],
'n_estimators': [100, 200, 300, 500],
'warm_start': [True, False]},
scoring='r2')RandomForestRegressor(n_jobs=-1, random_state=24, verbose=2)
rfr_tuned.best_params_
{'criterion': 'mae', 'max_depth': 6, 'n_estimators': 200, 'warm_start': True}
y_pred_rfr_tuned, rmse_rfr_tuned, mae_rfr_tuned, r2_rfr_tuned = build_regressor(rfr_tuned, X_val, y_val)
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 138 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 200 out of 200 | elapsed: 0.0s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 138 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 200 out of 200 | elapsed: 0.0s finished
Model Score (Adjusted R2) of Train set : 0.917 Model RMSE MAE R2 Regressor on Validation Set 6.38 4.91 0.86
rfr_cv_score = plot_cross_val_score(rfr_tuned, X_val, y_val, cv=10, alpha=0.95, scoring = 'r2')
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 0.0s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed: 0.0s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 0.1s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 138 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 300 out of 300 | elapsed: 0.0s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 0.0s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed: 0.0s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 0.0s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed: 0.0s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 402 tasks | elapsed: 0.1s [Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 0.2s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 138 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 341 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 500 out of 500 | elapsed: 0.0s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 402 tasks | elapsed: 0.1s [Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 0.2s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 138 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 341 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 500 out of 500 | elapsed: 0.0s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 402 tasks | elapsed: 0.1s [Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 0.2s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 138 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 341 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 500 out of 500 | elapsed: 0.0s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 402 tasks | elapsed: 0.1s [Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 0.2s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 138 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 341 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 500 out of 500 | elapsed: 0.0s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 402 tasks | elapsed: 0.1s [Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 0.2s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 138 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 341 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 500 out of 500 | elapsed: 0.0s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=-1)]: Done 402 tasks | elapsed: 0.1s [Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 0.2s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 138 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 341 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 500 out of 500 | elapsed: 0.0s finished
Cross validation score (Mean): 0.773 Cross validation score (Std Dev): 0.077 CV Score Mean+-3SD : [ 0.618214151207358, 0.9271377435632064] 95.0 confidence interval 64.5% and 87.5%
xgbr = xgb.XGBRegressor(
booster='gbtree',
random_state=24,
n_jobs=-1,
gamma=0,
importance_type='gain',
learning_rate=0.1
)
xgbr.fit(
X_train,
y_train,
verbose=True,
early_stopping_rounds=10, #Stop the Tree Growth when there is no visible improvement for 10 iterations
eval_metric='rmse',
eval_set=[(X_train, y_train), (X_val, y_val)]
)
[0] validation_0-rmse:35.50664 validation_1-rmse:35.30619 Multiple eval metrics have been passed: 'validation_1-rmse' will be used for early stopping. Will train until validation_1-rmse hasn't improved in 10 rounds. [1] validation_0-rmse:32.17999 validation_1-rmse:32.15121 [2] validation_0-rmse:29.18495 validation_1-rmse:29.24598 [3] validation_0-rmse:26.49638 validation_1-rmse:26.63189 [4] validation_0-rmse:24.07676 validation_1-rmse:24.22625 [5] validation_0-rmse:21.88730 validation_1-rmse:22.12069 [6] validation_0-rmse:19.91555 validation_1-rmse:20.22836 [7] validation_0-rmse:18.15525 validation_1-rmse:18.52573 [8] validation_0-rmse:16.56103 validation_1-rmse:17.03556 [9] validation_0-rmse:15.10954 validation_1-rmse:15.62971 [10] validation_0-rmse:13.81015 validation_1-rmse:14.38725 [11] validation_0-rmse:12.64528 validation_1-rmse:13.24993 [12] validation_0-rmse:11.59406 validation_1-rmse:12.28435 [13] validation_0-rmse:10.65263 validation_1-rmse:11.37942 [14] validation_0-rmse:9.80818 validation_1-rmse:10.63560 [15] validation_0-rmse:9.04659 validation_1-rmse:9.93087 [16] validation_0-rmse:8.36334 validation_1-rmse:9.29879 [17] validation_0-rmse:7.74613 validation_1-rmse:8.78563 [18] validation_0-rmse:7.19691 validation_1-rmse:8.33863 [19] validation_0-rmse:6.69673 validation_1-rmse:7.94067 [20] validation_0-rmse:6.23712 validation_1-rmse:7.55007 [21] validation_0-rmse:5.83866 validation_1-rmse:7.22837 [22] validation_0-rmse:5.47583 validation_1-rmse:6.95171 [23] validation_0-rmse:5.14211 validation_1-rmse:6.69638 [24] validation_0-rmse:4.84987 validation_1-rmse:6.45897 [25] validation_0-rmse:4.59555 validation_1-rmse:6.27447 [26] validation_0-rmse:4.35711 validation_1-rmse:6.10479 [27] validation_0-rmse:4.14105 validation_1-rmse:5.94877 [28] validation_0-rmse:3.95332 validation_1-rmse:5.81730 [29] validation_0-rmse:3.78168 validation_1-rmse:5.71938 [30] validation_0-rmse:3.62899 validation_1-rmse:5.63399 [31] validation_0-rmse:3.49520 validation_1-rmse:5.55199 [32] validation_0-rmse:3.37584 validation_1-rmse:5.48016 [33] validation_0-rmse:3.26960 validation_1-rmse:5.43705 [34] validation_0-rmse:3.17498 validation_1-rmse:5.37984 [35] validation_0-rmse:3.09019 validation_1-rmse:5.34041 [36] validation_0-rmse:3.01254 validation_1-rmse:5.28374 [37] validation_0-rmse:2.93919 validation_1-rmse:5.23322 [38] validation_0-rmse:2.86075 validation_1-rmse:5.18027 [39] validation_0-rmse:2.79460 validation_1-rmse:5.15306 [40] validation_0-rmse:2.74799 validation_1-rmse:5.12173 [41] validation_0-rmse:2.69050 validation_1-rmse:5.09179 [42] validation_0-rmse:2.64161 validation_1-rmse:5.06285 [43] validation_0-rmse:2.60168 validation_1-rmse:5.03877 [44] validation_0-rmse:2.57334 validation_1-rmse:5.01809 [45] validation_0-rmse:2.54723 validation_1-rmse:5.00347 [46] validation_0-rmse:2.50475 validation_1-rmse:4.98144 [47] validation_0-rmse:2.46118 validation_1-rmse:4.95694 [48] validation_0-rmse:2.43056 validation_1-rmse:4.94617 [49] validation_0-rmse:2.39462 validation_1-rmse:4.93178 [50] validation_0-rmse:2.37399 validation_1-rmse:4.91098 [51] validation_0-rmse:2.35616 validation_1-rmse:4.89879 [52] validation_0-rmse:2.33143 validation_1-rmse:4.88815 [53] validation_0-rmse:2.30583 validation_1-rmse:4.88080 [54] validation_0-rmse:2.29055 validation_1-rmse:4.87548 [55] validation_0-rmse:2.27366 validation_1-rmse:4.86130 [56] validation_0-rmse:2.24962 validation_1-rmse:4.85134 [57] validation_0-rmse:2.23327 validation_1-rmse:4.84749 [58] validation_0-rmse:2.21455 validation_1-rmse:4.84556 [59] validation_0-rmse:2.19721 validation_1-rmse:4.83428 [60] validation_0-rmse:2.17924 validation_1-rmse:4.83100 [61] validation_0-rmse:2.16409 validation_1-rmse:4.81869 [62] validation_0-rmse:2.14163 validation_1-rmse:4.80243 [63] validation_0-rmse:2.11270 validation_1-rmse:4.78415 [64] validation_0-rmse:2.09550 validation_1-rmse:4.76701 [65] validation_0-rmse:2.06832 validation_1-rmse:4.76788 [66] validation_0-rmse:2.05998 validation_1-rmse:4.76206 [67] validation_0-rmse:2.04308 validation_1-rmse:4.75258 [68] validation_0-rmse:2.03131 validation_1-rmse:4.74164 [69] validation_0-rmse:2.02028 validation_1-rmse:4.73803 [70] validation_0-rmse:2.00724 validation_1-rmse:4.73151 [71] validation_0-rmse:1.99331 validation_1-rmse:4.72687 [72] validation_0-rmse:1.97029 validation_1-rmse:4.71400 [73] validation_0-rmse:1.96049 validation_1-rmse:4.70460 [74] validation_0-rmse:1.94318 validation_1-rmse:4.69421 [75] validation_0-rmse:1.93689 validation_1-rmse:4.68861 [76] validation_0-rmse:1.92373 validation_1-rmse:4.68271 [77] validation_0-rmse:1.90953 validation_1-rmse:4.67366 [78] validation_0-rmse:1.89955 validation_1-rmse:4.66254 [79] validation_0-rmse:1.89109 validation_1-rmse:4.65630 [80] validation_0-rmse:1.88232 validation_1-rmse:4.64607 [81] validation_0-rmse:1.87404 validation_1-rmse:4.64381 [82] validation_0-rmse:1.87110 validation_1-rmse:4.64397 [83] validation_0-rmse:1.86483 validation_1-rmse:4.64107 [84] validation_0-rmse:1.84803 validation_1-rmse:4.64043 [85] validation_0-rmse:1.84235 validation_1-rmse:4.63768 [86] validation_0-rmse:1.83863 validation_1-rmse:4.63593 [87] validation_0-rmse:1.83405 validation_1-rmse:4.63304 [88] validation_0-rmse:1.82518 validation_1-rmse:4.63009 [89] validation_0-rmse:1.81708 validation_1-rmse:4.62769 [90] validation_0-rmse:1.81448 validation_1-rmse:4.62691 [91] validation_0-rmse:1.80666 validation_1-rmse:4.62420 [92] validation_0-rmse:1.79677 validation_1-rmse:4.62171 [93] validation_0-rmse:1.78660 validation_1-rmse:4.62162 [94] validation_0-rmse:1.76501 validation_1-rmse:4.61876 [95] validation_0-rmse:1.75931 validation_1-rmse:4.61485 [96] validation_0-rmse:1.75303 validation_1-rmse:4.60851 [97] validation_0-rmse:1.74882 validation_1-rmse:4.61028 [98] validation_0-rmse:1.74252 validation_1-rmse:4.60421 [99] validation_0-rmse:1.73484 validation_1-rmse:4.59894
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.1, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=-1, num_parallel_tree=1, random_state=24,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)y_pred_xgbr, rmse_xgbr, mae_xgbr, r2_xgbr = build_regressor(xgbr, X_val, y_val)
Model Score (Adjusted R2) of Train set : 0.989 Model RMSE MAE R2 Regressor on Validation Set 4.60 3.20 0.93
plot_feature_importance(xgbr)
Amount of variance explained by the predictors
| feature | importance | cumulative_sum | |
|---|---|---|---|
| 7 | CLUSTER | 0.569 | 56.900002 |
| 6 | water_cement_ratio | 0.192 | 76.100006 |
| 5 | age | 0.146 | 90.700005 |
| 0 | slag | 0.031 | 93.800011 |
| 2 | superplastic | 0.027 | 96.500008 |
| 3 | coarseagg | 0.013 | 97.800011 |
| 4 | fineagg | 0.013 | 99.100014 |
| 1 | ash | 0.010 | 100.100014 |
param_grid = {
'n_estimators' : [100, 200, 300, 500],
'max_depth' : [3,4,5,6], #Tree Depth
'learning_rate' : [0.1,0.01,0.05], #Shrinkage Parameter to avoid being Overfit
'gamma' : [0,0.25,1.0], #Regularization Parameter
'reg_lambda' : [0,1.0,10.0] #Regularization Parameter
}
xgbr_tuned = GridSearchCV(
xgb.XGBRegressor(booster='gbtree', objective='reg:squarederror'),
param_grid=param_grid,
scoring='r2',
cv=KFold(n_splits=10),
n_jobs=-1
)
xgbr_tuned.fit(
X_train,
y_train,
verbose=True,
early_stopping_rounds=10, #Stop the Tree Growth when there is no visible improvement for 10 iterations
eval_metric='rmse',
eval_set=[(X_train, y_train), (X_val, y_val)]
)
[0] validation_0-rmse:35.56903 validation_1-rmse:35.35297 Multiple eval metrics have been passed: 'validation_1-rmse' will be used for early stopping. Will train until validation_1-rmse hasn't improved in 10 rounds. [1] validation_0-rmse:32.31321 validation_1-rmse:32.21752 [2] validation_0-rmse:29.39855 validation_1-rmse:29.24589 [3] validation_0-rmse:26.78907 validation_1-rmse:26.75005 [4] validation_0-rmse:24.45123 validation_1-rmse:24.40243 [5] validation_0-rmse:22.32931 validation_1-rmse:22.24762 [6] validation_0-rmse:20.45621 validation_1-rmse:20.33900 [7] validation_0-rmse:18.76282 validation_1-rmse:18.70053 [8] validation_0-rmse:17.25845 validation_1-rmse:17.23185 [9] validation_0-rmse:15.90760 validation_1-rmse:15.88840 [10] validation_0-rmse:14.72063 validation_1-rmse:14.77298 [11] validation_0-rmse:13.66071 validation_1-rmse:13.73216 [12] validation_0-rmse:12.66084 validation_1-rmse:12.78357 [13] validation_0-rmse:11.82973 validation_1-rmse:11.98815 [14] validation_0-rmse:11.02939 validation_1-rmse:11.24890 [15] validation_0-rmse:10.34203 validation_1-rmse:10.62006 [16] validation_0-rmse:9.74883 validation_1-rmse:10.02299 [17] validation_0-rmse:9.21861 validation_1-rmse:9.53580 [18] validation_0-rmse:8.73196 validation_1-rmse:9.06771 [19] validation_0-rmse:8.30180 validation_1-rmse:8.74207 [20] validation_0-rmse:7.92629 validation_1-rmse:8.41525 [21] validation_0-rmse:7.61413 validation_1-rmse:8.15105 [22] validation_0-rmse:7.31045 validation_1-rmse:7.90438 [23] validation_0-rmse:7.07358 validation_1-rmse:7.71577 [24] validation_0-rmse:6.84072 validation_1-rmse:7.52585 [25] validation_0-rmse:6.64050 validation_1-rmse:7.36828 [26] validation_0-rmse:6.46544 validation_1-rmse:7.23218 [27] validation_0-rmse:6.28519 validation_1-rmse:7.09591 [28] validation_0-rmse:6.14546 validation_1-rmse:6.97390 [29] validation_0-rmse:6.01794 validation_1-rmse:6.87613 [30] validation_0-rmse:5.88272 validation_1-rmse:6.77459 [31] validation_0-rmse:5.78136 validation_1-rmse:6.69430 [32] validation_0-rmse:5.67327 validation_1-rmse:6.62548 [33] validation_0-rmse:5.58455 validation_1-rmse:6.55986 [34] validation_0-rmse:5.51187 validation_1-rmse:6.50263 [35] validation_0-rmse:5.43224 validation_1-rmse:6.47504 [36] validation_0-rmse:5.35880 validation_1-rmse:6.41008 [37] validation_0-rmse:5.27793 validation_1-rmse:6.32588 [38] validation_0-rmse:5.21369 validation_1-rmse:6.28812 [39] validation_0-rmse:5.15995 validation_1-rmse:6.24107 [40] validation_0-rmse:5.10706 validation_1-rmse:6.20963 [41] validation_0-rmse:5.05590 validation_1-rmse:6.19101 [42] validation_0-rmse:4.98981 validation_1-rmse:6.14847 [43] validation_0-rmse:4.95128 validation_1-rmse:6.10694 [44] validation_0-rmse:4.91041 validation_1-rmse:6.08011 [45] validation_0-rmse:4.87190 validation_1-rmse:6.05755 [46] validation_0-rmse:4.82926 validation_1-rmse:6.02010 [47] validation_0-rmse:4.80185 validation_1-rmse:6.00940 [48] validation_0-rmse:4.77341 validation_1-rmse:5.98275 [49] validation_0-rmse:4.74149 validation_1-rmse:5.95372 [50] validation_0-rmse:4.70708 validation_1-rmse:5.93228 [51] validation_0-rmse:4.68762 validation_1-rmse:5.92293 [52] validation_0-rmse:4.64922 validation_1-rmse:5.87688 [53] validation_0-rmse:4.61954 validation_1-rmse:5.86170 [54] validation_0-rmse:4.59350 validation_1-rmse:5.84162 [55] validation_0-rmse:4.57422 validation_1-rmse:5.83753 [56] validation_0-rmse:4.53962 validation_1-rmse:5.82351 [57] validation_0-rmse:4.50649 validation_1-rmse:5.81074 [58] validation_0-rmse:4.46496 validation_1-rmse:5.76367 [59] validation_0-rmse:4.43438 validation_1-rmse:5.74483 [60] validation_0-rmse:4.40331 validation_1-rmse:5.71052 [61] validation_0-rmse:4.38283 validation_1-rmse:5.70531 [62] validation_0-rmse:4.36759 validation_1-rmse:5.69910 [63] validation_0-rmse:4.35433 validation_1-rmse:5.69036 [64] validation_0-rmse:4.33514 validation_1-rmse:5.68453 [65] validation_0-rmse:4.32354 validation_1-rmse:5.68213 [66] validation_0-rmse:4.31052 validation_1-rmse:5.67727 [67] validation_0-rmse:4.28023 validation_1-rmse:5.67942 [68] validation_0-rmse:4.25629 validation_1-rmse:5.64366 [69] validation_0-rmse:4.24221 validation_1-rmse:5.63380 [70] validation_0-rmse:4.23249 validation_1-rmse:5.63268 [71] validation_0-rmse:4.21126 validation_1-rmse:5.62786 [72] validation_0-rmse:4.18400 validation_1-rmse:5.60836 [73] validation_0-rmse:4.15732 validation_1-rmse:5.60188 [74] validation_0-rmse:4.14392 validation_1-rmse:5.59207 [75] validation_0-rmse:4.11625 validation_1-rmse:5.56928 [76] validation_0-rmse:4.09864 validation_1-rmse:5.56156 [77] validation_0-rmse:4.07024 validation_1-rmse:5.52499 [78] validation_0-rmse:4.04788 validation_1-rmse:5.50875 [79] validation_0-rmse:4.01896 validation_1-rmse:5.48318 [80] validation_0-rmse:3.99749 validation_1-rmse:5.46263 [81] validation_0-rmse:3.97630 validation_1-rmse:5.45603 [82] validation_0-rmse:3.96124 validation_1-rmse:5.44064 [83] validation_0-rmse:3.95290 validation_1-rmse:5.43754 [84] validation_0-rmse:3.93876 validation_1-rmse:5.43192 [85] validation_0-rmse:3.92113 validation_1-rmse:5.41947 [86] validation_0-rmse:3.90895 validation_1-rmse:5.40979 [87] validation_0-rmse:3.89310 validation_1-rmse:5.37507 [88] validation_0-rmse:3.88398 validation_1-rmse:5.37685 [89] validation_0-rmse:3.86189 validation_1-rmse:5.34328 [90] validation_0-rmse:3.84678 validation_1-rmse:5.33385 [91] validation_0-rmse:3.82965 validation_1-rmse:5.30373 [92] validation_0-rmse:3.82268 validation_1-rmse:5.30136 [93] validation_0-rmse:3.81103 validation_1-rmse:5.29516 [94] validation_0-rmse:3.79216 validation_1-rmse:5.27561 [95] validation_0-rmse:3.77822 validation_1-rmse:5.26874 [96] validation_0-rmse:3.76837 validation_1-rmse:5.26022 [97] validation_0-rmse:3.75241 validation_1-rmse:5.24552 [98] validation_0-rmse:3.74038 validation_1-rmse:5.23512 [99] validation_0-rmse:3.73414 validation_1-rmse:5.23279 [100] validation_0-rmse:3.71601 validation_1-rmse:5.21621 [101] validation_0-rmse:3.70865 validation_1-rmse:5.21685 [102] validation_0-rmse:3.69789 validation_1-rmse:5.21496 [103] validation_0-rmse:3.69263 validation_1-rmse:5.21244 [104] validation_0-rmse:3.68639 validation_1-rmse:5.20582 [105] validation_0-rmse:3.67974 validation_1-rmse:5.19701 [106] validation_0-rmse:3.66999 validation_1-rmse:5.19134 [107] validation_0-rmse:3.66334 validation_1-rmse:5.19132 [108] validation_0-rmse:3.65950 validation_1-rmse:5.18993 [109] validation_0-rmse:3.63826 validation_1-rmse:5.17284 [110] validation_0-rmse:3.62691 validation_1-rmse:5.16637 [111] validation_0-rmse:3.61532 validation_1-rmse:5.15343 [112] validation_0-rmse:3.61002 validation_1-rmse:5.15194 [113] validation_0-rmse:3.59981 validation_1-rmse:5.14460 [114] validation_0-rmse:3.57790 validation_1-rmse:5.12669 [115] validation_0-rmse:3.57466 validation_1-rmse:5.12413 [116] validation_0-rmse:3.56612 validation_1-rmse:5.12072 [117] validation_0-rmse:3.55736 validation_1-rmse:5.11813 [118] validation_0-rmse:3.54783 validation_1-rmse:5.11402 [119] validation_0-rmse:3.53256 validation_1-rmse:5.09710 [120] validation_0-rmse:3.51102 validation_1-rmse:5.06655 [121] validation_0-rmse:3.49326 validation_1-rmse:5.05222 [122] validation_0-rmse:3.48312 validation_1-rmse:5.04632 [123] validation_0-rmse:3.47555 validation_1-rmse:5.03087 [124] validation_0-rmse:3.46850 validation_1-rmse:5.02506 [125] validation_0-rmse:3.45705 validation_1-rmse:5.01940 [126] validation_0-rmse:3.44964 validation_1-rmse:5.02009 [127] validation_0-rmse:3.44188 validation_1-rmse:5.01501 [128] validation_0-rmse:3.43281 validation_1-rmse:5.00476 [129] validation_0-rmse:3.42598 validation_1-rmse:5.00034 [130] validation_0-rmse:3.42155 validation_1-rmse:4.99964 [131] validation_0-rmse:3.41903 validation_1-rmse:4.99700 [132] validation_0-rmse:3.40535 validation_1-rmse:4.98384 [133] validation_0-rmse:3.39020 validation_1-rmse:4.98198 [134] validation_0-rmse:3.38408 validation_1-rmse:4.97728 [135] validation_0-rmse:3.37767 validation_1-rmse:4.97158 [136] validation_0-rmse:3.36858 validation_1-rmse:4.97100 [137] validation_0-rmse:3.35916 validation_1-rmse:4.96918 [138] validation_0-rmse:3.35146 validation_1-rmse:4.96232 [139] validation_0-rmse:3.34362 validation_1-rmse:4.96307 [140] validation_0-rmse:3.33914 validation_1-rmse:4.95204 [141] validation_0-rmse:3.33689 validation_1-rmse:4.95113 [142] validation_0-rmse:3.32737 validation_1-rmse:4.94820 [143] validation_0-rmse:3.32064 validation_1-rmse:4.94556 [144] validation_0-rmse:3.31475 validation_1-rmse:4.94846 [145] validation_0-rmse:3.29930 validation_1-rmse:4.93812 [146] validation_0-rmse:3.29534 validation_1-rmse:4.93684 [147] validation_0-rmse:3.29336 validation_1-rmse:4.93481 [148] validation_0-rmse:3.28920 validation_1-rmse:4.93161 [149] validation_0-rmse:3.28492 validation_1-rmse:4.93317 [150] validation_0-rmse:3.27943 validation_1-rmse:4.92842 [151] validation_0-rmse:3.27532 validation_1-rmse:4.92347 [152] validation_0-rmse:3.26617 validation_1-rmse:4.91677 [153] validation_0-rmse:3.26440 validation_1-rmse:4.91638 [154] validation_0-rmse:3.25623 validation_1-rmse:4.91873 [155] validation_0-rmse:3.25251 validation_1-rmse:4.91362 [156] validation_0-rmse:3.24590 validation_1-rmse:4.91143 [157] validation_0-rmse:3.23231 validation_1-rmse:4.91100 [158] validation_0-rmse:3.22426 validation_1-rmse:4.90811 [159] validation_0-rmse:3.21950 validation_1-rmse:4.90434 [160] validation_0-rmse:3.21670 validation_1-rmse:4.90390 [161] validation_0-rmse:3.20374 validation_1-rmse:4.89126 [162] validation_0-rmse:3.19005 validation_1-rmse:4.87562 [163] validation_0-rmse:3.18454 validation_1-rmse:4.87493 [164] validation_0-rmse:3.17341 validation_1-rmse:4.87574 [165] validation_0-rmse:3.16342 validation_1-rmse:4.86793 [166] validation_0-rmse:3.15812 validation_1-rmse:4.86768 [167] validation_0-rmse:3.15390 validation_1-rmse:4.86733 [168] validation_0-rmse:3.14519 validation_1-rmse:4.85380 [169] validation_0-rmse:3.13581 validation_1-rmse:4.85532 [170] validation_0-rmse:3.12507 validation_1-rmse:4.85705 [171] validation_0-rmse:3.11279 validation_1-rmse:4.85610 [172] validation_0-rmse:3.10798 validation_1-rmse:4.85738 [173] validation_0-rmse:3.09992 validation_1-rmse:4.85952 [174] validation_0-rmse:3.09575 validation_1-rmse:4.85525 [175] validation_0-rmse:3.09193 validation_1-rmse:4.85409 [176] validation_0-rmse:3.08800 validation_1-rmse:4.85393 [177] validation_0-rmse:3.07583 validation_1-rmse:4.84177 [178] validation_0-rmse:3.06992 validation_1-rmse:4.84085 [179] validation_0-rmse:3.06728 validation_1-rmse:4.84451 [180] validation_0-rmse:3.05955 validation_1-rmse:4.83732 [181] validation_0-rmse:3.04842 validation_1-rmse:4.83796 [182] validation_0-rmse:3.04698 validation_1-rmse:4.83769 [183] validation_0-rmse:3.03519 validation_1-rmse:4.82055 [184] validation_0-rmse:3.03182 validation_1-rmse:4.81948 [185] validation_0-rmse:3.02629 validation_1-rmse:4.82098 [186] validation_0-rmse:3.01768 validation_1-rmse:4.81836 [187] validation_0-rmse:3.01362 validation_1-rmse:4.81373 [188] validation_0-rmse:3.00622 validation_1-rmse:4.80959 [189] validation_0-rmse:2.99628 validation_1-rmse:4.80014 [190] validation_0-rmse:2.99288 validation_1-rmse:4.79829 [191] validation_0-rmse:2.98168 validation_1-rmse:4.79307 [192] validation_0-rmse:2.97894 validation_1-rmse:4.79235 [193] validation_0-rmse:2.97602 validation_1-rmse:4.79314 [194] validation_0-rmse:2.96619 validation_1-rmse:4.77351 [195] validation_0-rmse:2.96223 validation_1-rmse:4.77664 [196] validation_0-rmse:2.95880 validation_1-rmse:4.77890 [197] validation_0-rmse:2.95578 validation_1-rmse:4.77705 [198] validation_0-rmse:2.95358 validation_1-rmse:4.77959 [199] validation_0-rmse:2.94656 validation_1-rmse:4.77796 [200] validation_0-rmse:2.94199 validation_1-rmse:4.78233 [201] validation_0-rmse:2.93000 validation_1-rmse:4.77218 [202] validation_0-rmse:2.92810 validation_1-rmse:4.77113 [203] validation_0-rmse:2.92598 validation_1-rmse:4.77071 [204] validation_0-rmse:2.91740 validation_1-rmse:4.75837 [205] validation_0-rmse:2.91021 validation_1-rmse:4.75405 [206] validation_0-rmse:2.90043 validation_1-rmse:4.74217 [207] validation_0-rmse:2.89399 validation_1-rmse:4.73659 [208] validation_0-rmse:2.88766 validation_1-rmse:4.73022 [209] validation_0-rmse:2.88462 validation_1-rmse:4.73020 [210] validation_0-rmse:2.87720 validation_1-rmse:4.72612 [211] validation_0-rmse:2.87004 validation_1-rmse:4.72321 [212] validation_0-rmse:2.86696 validation_1-rmse:4.72604 [213] validation_0-rmse:2.86456 validation_1-rmse:4.72482 [214] validation_0-rmse:2.85971 validation_1-rmse:4.72274 [215] validation_0-rmse:2.85408 validation_1-rmse:4.72497 [216] validation_0-rmse:2.84782 validation_1-rmse:4.72517 [217] validation_0-rmse:2.84617 validation_1-rmse:4.72530 [218] validation_0-rmse:2.84494 validation_1-rmse:4.72511 [219] validation_0-rmse:2.84117 validation_1-rmse:4.72222 [220] validation_0-rmse:2.83839 validation_1-rmse:4.72284 [221] validation_0-rmse:2.83100 validation_1-rmse:4.72038 [222] validation_0-rmse:2.82602 validation_1-rmse:4.72267 [223] validation_0-rmse:2.82330 validation_1-rmse:4.72759 [224] validation_0-rmse:2.81692 validation_1-rmse:4.72479 [225] validation_0-rmse:2.81416 validation_1-rmse:4.72115 [226] validation_0-rmse:2.80877 validation_1-rmse:4.71577 [227] validation_0-rmse:2.80115 validation_1-rmse:4.70977 [228] validation_0-rmse:2.79774 validation_1-rmse:4.71198 [229] validation_0-rmse:2.79300 validation_1-rmse:4.70671 [230] validation_0-rmse:2.78768 validation_1-rmse:4.70369 [231] validation_0-rmse:2.78350 validation_1-rmse:4.70538 [232] validation_0-rmse:2.77937 validation_1-rmse:4.70941 [233] validation_0-rmse:2.77009 validation_1-rmse:4.71227 [234] validation_0-rmse:2.76648 validation_1-rmse:4.71154 [235] validation_0-rmse:2.76044 validation_1-rmse:4.71095 [236] validation_0-rmse:2.75291 validation_1-rmse:4.70813 [237] validation_0-rmse:2.74962 validation_1-rmse:4.70859 [238] validation_0-rmse:2.74266 validation_1-rmse:4.70099 [239] validation_0-rmse:2.74035 validation_1-rmse:4.69985 [240] validation_0-rmse:2.73403 validation_1-rmse:4.69738 [241] validation_0-rmse:2.73164 validation_1-rmse:4.69949 [242] validation_0-rmse:2.72196 validation_1-rmse:4.69475 [243] validation_0-rmse:2.71957 validation_1-rmse:4.69775 [244] validation_0-rmse:2.71615 validation_1-rmse:4.69825 [245] validation_0-rmse:2.70984 validation_1-rmse:4.69411 [246] validation_0-rmse:2.70771 validation_1-rmse:4.69784 [247] validation_0-rmse:2.70359 validation_1-rmse:4.69395 [248] validation_0-rmse:2.69969 validation_1-rmse:4.68717 [249] validation_0-rmse:2.69776 validation_1-rmse:4.69184 [250] validation_0-rmse:2.69205 validation_1-rmse:4.69159 [251] validation_0-rmse:2.69030 validation_1-rmse:4.69193 [252] validation_0-rmse:2.68759 validation_1-rmse:4.68428 [253] validation_0-rmse:2.68255 validation_1-rmse:4.68228 [254] validation_0-rmse:2.67595 validation_1-rmse:4.67700 [255] validation_0-rmse:2.67419 validation_1-rmse:4.68067 [256] validation_0-rmse:2.66854 validation_1-rmse:4.67782 [257] validation_0-rmse:2.66550 validation_1-rmse:4.68129 [258] validation_0-rmse:2.66284 validation_1-rmse:4.68025 [259] validation_0-rmse:2.65928 validation_1-rmse:4.67889 [260] validation_0-rmse:2.65428 validation_1-rmse:4.67790 [261] validation_0-rmse:2.65139 validation_1-rmse:4.67697 [262] validation_0-rmse:2.64677 validation_1-rmse:4.66949 [263] validation_0-rmse:2.64340 validation_1-rmse:4.66359 [264] validation_0-rmse:2.63607 validation_1-rmse:4.66707 [265] validation_0-rmse:2.63025 validation_1-rmse:4.66519 [266] validation_0-rmse:2.62486 validation_1-rmse:4.66233 [267] validation_0-rmse:2.62061 validation_1-rmse:4.65941 [268] validation_0-rmse:2.61624 validation_1-rmse:4.65541 [269] validation_0-rmse:2.61203 validation_1-rmse:4.65583 [270] validation_0-rmse:2.60974 validation_1-rmse:4.65440 [271] validation_0-rmse:2.60206 validation_1-rmse:4.65237 [272] validation_0-rmse:2.60010 validation_1-rmse:4.65018 [273] validation_0-rmse:2.59268 validation_1-rmse:4.64320 [274] validation_0-rmse:2.58612 validation_1-rmse:4.64189 [275] validation_0-rmse:2.58490 validation_1-rmse:4.64105 [276] validation_0-rmse:2.58209 validation_1-rmse:4.63706 [277] validation_0-rmse:2.57591 validation_1-rmse:4.64099 [278] validation_0-rmse:2.57450 validation_1-rmse:4.64137 [279] validation_0-rmse:2.57306 validation_1-rmse:4.64344 [280] validation_0-rmse:2.56896 validation_1-rmse:4.64195 [281] validation_0-rmse:2.56333 validation_1-rmse:4.64676 [282] validation_0-rmse:2.56203 validation_1-rmse:4.64757 [283] validation_0-rmse:2.55908 validation_1-rmse:4.64359 [284] validation_0-rmse:2.55791 validation_1-rmse:4.64259 [285] validation_0-rmse:2.55440 validation_1-rmse:4.63914 [286] validation_0-rmse:2.55266 validation_1-rmse:4.63693 [287] validation_0-rmse:2.55039 validation_1-rmse:4.63643 [288] validation_0-rmse:2.54595 validation_1-rmse:4.63129 [289] validation_0-rmse:2.54307 validation_1-rmse:4.63111 [290] validation_0-rmse:2.54026 validation_1-rmse:4.62442 [291] validation_0-rmse:2.53774 validation_1-rmse:4.62177 [292] validation_0-rmse:2.53634 validation_1-rmse:4.62393 [293] validation_0-rmse:2.53398 validation_1-rmse:4.62331 [294] validation_0-rmse:2.52861 validation_1-rmse:4.62284 [295] validation_0-rmse:2.52066 validation_1-rmse:4.62290 [296] validation_0-rmse:2.51940 validation_1-rmse:4.62328 [297] validation_0-rmse:2.51649 validation_1-rmse:4.62061 [298] validation_0-rmse:2.51158 validation_1-rmse:4.62070 [299] validation_0-rmse:2.51019 validation_1-rmse:4.62437 [300] validation_0-rmse:2.50747 validation_1-rmse:4.62462 [301] validation_0-rmse:2.50506 validation_1-rmse:4.62343 [302] validation_0-rmse:2.49642 validation_1-rmse:4.62200 [303] validation_0-rmse:2.49359 validation_1-rmse:4.62129 [304] validation_0-rmse:2.49129 validation_1-rmse:4.62068 [305] validation_0-rmse:2.48403 validation_1-rmse:4.61978 [306] validation_0-rmse:2.48060 validation_1-rmse:4.61786 [307] validation_0-rmse:2.47414 validation_1-rmse:4.61253 [308] validation_0-rmse:2.47066 validation_1-rmse:4.61126 [309] validation_0-rmse:2.46443 validation_1-rmse:4.61050 [310] validation_0-rmse:2.45907 validation_1-rmse:4.60996 [311] validation_0-rmse:2.45542 validation_1-rmse:4.60467 [312] validation_0-rmse:2.45269 validation_1-rmse:4.59815 [313] validation_0-rmse:2.45185 validation_1-rmse:4.59772 [314] validation_0-rmse:2.44997 validation_1-rmse:4.59719 [315] validation_0-rmse:2.44459 validation_1-rmse:4.59691 [316] validation_0-rmse:2.44160 validation_1-rmse:4.59734 [317] validation_0-rmse:2.43935 validation_1-rmse:4.59495 [318] validation_0-rmse:2.43589 validation_1-rmse:4.59605 [319] validation_0-rmse:2.43479 validation_1-rmse:4.59523 [320] validation_0-rmse:2.43343 validation_1-rmse:4.59520 [321] validation_0-rmse:2.43114 validation_1-rmse:4.59503 [322] validation_0-rmse:2.42788 validation_1-rmse:4.59369 [323] validation_0-rmse:2.42671 validation_1-rmse:4.59345 [324] validation_0-rmse:2.42175 validation_1-rmse:4.58901 [325] validation_0-rmse:2.42007 validation_1-rmse:4.58730 [326] validation_0-rmse:2.41542 validation_1-rmse:4.58985 [327] validation_0-rmse:2.41259 validation_1-rmse:4.58469 [328] validation_0-rmse:2.41023 validation_1-rmse:4.58416 [329] validation_0-rmse:2.40731 validation_1-rmse:4.58499 [330] validation_0-rmse:2.40278 validation_1-rmse:4.58256 [331] validation_0-rmse:2.40050 validation_1-rmse:4.58319 [332] validation_0-rmse:2.39600 validation_1-rmse:4.58340 [333] validation_0-rmse:2.39479 validation_1-rmse:4.58447 [334] validation_0-rmse:2.39367 validation_1-rmse:4.58429 [335] validation_0-rmse:2.39155 validation_1-rmse:4.58086 [336] validation_0-rmse:2.38969 validation_1-rmse:4.58057 [337] validation_0-rmse:2.38721 validation_1-rmse:4.57699 [338] validation_0-rmse:2.38446 validation_1-rmse:4.57365 [339] validation_0-rmse:2.38218 validation_1-rmse:4.57411 [340] validation_0-rmse:2.38108 validation_1-rmse:4.57693 [341] validation_0-rmse:2.37964 validation_1-rmse:4.57780 [342] validation_0-rmse:2.37751 validation_1-rmse:4.57736 [343] validation_0-rmse:2.37650 validation_1-rmse:4.57722 [344] validation_0-rmse:2.37361 validation_1-rmse:4.57435 [345] validation_0-rmse:2.36984 validation_1-rmse:4.57191 [346] validation_0-rmse:2.36554 validation_1-rmse:4.56467 [347] validation_0-rmse:2.36391 validation_1-rmse:4.56125 [348] validation_0-rmse:2.36282 validation_1-rmse:4.56145 [349] validation_0-rmse:2.35812 validation_1-rmse:4.55670 [350] validation_0-rmse:2.35511 validation_1-rmse:4.55765 [351] validation_0-rmse:2.35287 validation_1-rmse:4.55435 [352] validation_0-rmse:2.34971 validation_1-rmse:4.55218 [353] validation_0-rmse:2.34828 validation_1-rmse:4.55032 [354] validation_0-rmse:2.34578 validation_1-rmse:4.54953 [355] validation_0-rmse:2.34357 validation_1-rmse:4.54877 [356] validation_0-rmse:2.33927 validation_1-rmse:4.55218 [357] validation_0-rmse:2.33845 validation_1-rmse:4.55213 [358] validation_0-rmse:2.33746 validation_1-rmse:4.55275 [359] validation_0-rmse:2.33602 validation_1-rmse:4.55100 [360] validation_0-rmse:2.33475 validation_1-rmse:4.55200 [361] validation_0-rmse:2.33043 validation_1-rmse:4.55214 [362] validation_0-rmse:2.32740 validation_1-rmse:4.55002 [363] validation_0-rmse:2.32504 validation_1-rmse:4.54887 [364] validation_0-rmse:2.32030 validation_1-rmse:4.54910 [365] validation_0-rmse:2.31813 validation_1-rmse:4.54180 [366] validation_0-rmse:2.31597 validation_1-rmse:4.54127 [367] validation_0-rmse:2.31510 validation_1-rmse:4.54390 [368] validation_0-rmse:2.31214 validation_1-rmse:4.54518 [369] validation_0-rmse:2.31103 validation_1-rmse:4.54615 [370] validation_0-rmse:2.30941 validation_1-rmse:4.54278 [371] validation_0-rmse:2.30785 validation_1-rmse:4.54267 [372] validation_0-rmse:2.30601 validation_1-rmse:4.53614 [373] validation_0-rmse:2.30508 validation_1-rmse:4.53636 [374] validation_0-rmse:2.30363 validation_1-rmse:4.53366 [375] validation_0-rmse:2.30125 validation_1-rmse:4.53294 [376] validation_0-rmse:2.30044 validation_1-rmse:4.53289 [377] validation_0-rmse:2.29770 validation_1-rmse:4.53406 [378] validation_0-rmse:2.29441 validation_1-rmse:4.53063 [379] validation_0-rmse:2.29077 validation_1-rmse:4.52867 [380] validation_0-rmse:2.28917 validation_1-rmse:4.52956 [381] validation_0-rmse:2.28843 validation_1-rmse:4.52912 [382] validation_0-rmse:2.28726 validation_1-rmse:4.52507 [383] validation_0-rmse:2.28615 validation_1-rmse:4.52464 [384] validation_0-rmse:2.28479 validation_1-rmse:4.52345 [385] validation_0-rmse:2.28349 validation_1-rmse:4.52066 [386] validation_0-rmse:2.28095 validation_1-rmse:4.51942 [387] validation_0-rmse:2.27695 validation_1-rmse:4.51360 [388] validation_0-rmse:2.27424 validation_1-rmse:4.51129 [389] validation_0-rmse:2.27305 validation_1-rmse:4.51002 [390] validation_0-rmse:2.27242 validation_1-rmse:4.50980 [391] validation_0-rmse:2.27141 validation_1-rmse:4.50894 [392] validation_0-rmse:2.26941 validation_1-rmse:4.50964 [393] validation_0-rmse:2.26721 validation_1-rmse:4.50678 [394] validation_0-rmse:2.26598 validation_1-rmse:4.50759 [395] validation_0-rmse:2.26338 validation_1-rmse:4.50476 [396] validation_0-rmse:2.25987 validation_1-rmse:4.50472 [397] validation_0-rmse:2.25852 validation_1-rmse:4.50501 [398] validation_0-rmse:2.25517 validation_1-rmse:4.50460 [399] validation_0-rmse:2.25461 validation_1-rmse:4.50418 [400] validation_0-rmse:2.25341 validation_1-rmse:4.50157 [401] validation_0-rmse:2.25241 validation_1-rmse:4.49976 [402] validation_0-rmse:2.25162 validation_1-rmse:4.49991 [403] validation_0-rmse:2.24764 validation_1-rmse:4.50133 [404] validation_0-rmse:2.24624 validation_1-rmse:4.50099 [405] validation_0-rmse:2.24518 validation_1-rmse:4.49875 [406] validation_0-rmse:2.24161 validation_1-rmse:4.49906 [407] validation_0-rmse:2.23834 validation_1-rmse:4.49883 [408] validation_0-rmse:2.23652 validation_1-rmse:4.49625 [409] validation_0-rmse:2.23570 validation_1-rmse:4.49702 [410] validation_0-rmse:2.23411 validation_1-rmse:4.49122 [411] validation_0-rmse:2.23297 validation_1-rmse:4.48992 [412] validation_0-rmse:2.23100 validation_1-rmse:4.48879 [413] validation_0-rmse:2.23046 validation_1-rmse:4.48885 [414] validation_0-rmse:2.22969 validation_1-rmse:4.49104 [415] validation_0-rmse:2.22689 validation_1-rmse:4.49174 [416] validation_0-rmse:2.22599 validation_1-rmse:4.49007 [417] validation_0-rmse:2.22505 validation_1-rmse:4.49004 [418] validation_0-rmse:2.22439 validation_1-rmse:4.49218 [419] validation_0-rmse:2.22348 validation_1-rmse:4.48982 [420] validation_0-rmse:2.22035 validation_1-rmse:4.49011 [421] validation_0-rmse:2.21932 validation_1-rmse:4.49065 [422] validation_0-rmse:2.21862 validation_1-rmse:4.49089 Stopping. Best iteration: [412] validation_0-rmse:2.23100 validation_1-rmse:4.48879
GridSearchCV(cv=KFold(n_splits=10, random_state=None, shuffle=False),
estimator=XGBRegressor(base_score=None, booster='gbtree',
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain',
interaction_constraints=None,
learning_rate=None, max_delta_step=None,
max_depth=None, min_child_weight=None,
missing=n...
n_estimators=100, n_jobs=None,
num_parallel_tree=None, random_state=None,
reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None,
tree_method=None, validate_parameters=None,
verbosity=None),
n_jobs=-1,
param_grid={'gamma': [0, 0.25, 1.0],
'learning_rate': [0.1, 0.01, 0.05],
'max_depth': [3, 4, 5, 6],
'n_estimators': [100, 200, 300, 500],
'reg_lambda': [0, 1.0, 10.0]},
scoring='r2')XGBRegressor(base_score=None, booster='gbtree', colsample_bylevel=None,
colsample_bynode=None, colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain', interaction_constraints=None,
learning_rate=None, max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None, tree_method=None,
validate_parameters=None, verbosity=None)xgbr_tuned.best_params_
{'gamma': 0.25,
'learning_rate': 0.1,
'max_depth': 3,
'n_estimators': 500,
'reg_lambda': 1.0}
y_pred_xgbr_tuned, rmse_xgbr_tuned, mae_xgbr_tuned, r2_xgbr_tuned = build_regressor(xgbr_tuned, X_val, y_val)
Model Score (Adjusted R2) of Train set : 0.982 Model RMSE MAE R2 Regressor on Validation Set 4.49 3.12 0.93
xgbr_cv_score = plot_cross_val_score(xgbr_tuned, X_val, y_val, cv=10, alpha=0.95, scoring = 'r2')
Cross validation score (Mean): 0.869 Cross validation score (Std Dev): 0.078 CV Score Mean+-3SD : [ 0.7140433081564227, 1.024146973790117] 95.0 confidence interval 69.4% and 92.7%
lgbmr = lgb.LGBMRegressor(
boosting_type='gbdt',
max_depth=- 1,
learning_rate=0.1,
n_estimators=100,
objective='regression',
reg_alpha=0.0, #L1 Regularisation
reg_lambda=0.0, #L2 Regularisation
random_state=24,
n_jobs=- 1,
silent=False,
importance_type='gain'
)
lgbmr.fit(
X_train,
y_train,
eval_set=[(X_train, y_train), (X_val, y_val)],
eval_metric='rmse',
early_stopping_rounds=10,
verbose=True
)
[1] training's rmse: 15.3463 training's l2: 235.508 valid_1's rmse: 15.8679 valid_1's l2: 251.789 Training until validation scores don't improve for 10 rounds [2] training's rmse: 14.1495 training's l2: 200.209 valid_1's rmse: 14.742 valid_1's l2: 217.326 [3] training's rmse: 13.0901 training's l2: 171.351 valid_1's rmse: 13.762 valid_1's l2: 189.393 [4] training's rmse: 12.1573 training's l2: 147.799 valid_1's rmse: 12.8748 valid_1's l2: 165.76 [5] training's rmse: 11.2912 training's l2: 127.491 valid_1's rmse: 12.0558 valid_1's l2: 145.343 [6] training's rmse: 10.4978 training's l2: 110.205 valid_1's rmse: 11.3012 valid_1's l2: 127.716 [7] training's rmse: 9.82237 training's l2: 96.4789 valid_1's rmse: 10.6591 valid_1's l2: 113.616 [8] training's rmse: 9.22694 training's l2: 85.1364 valid_1's rmse: 10.0736 valid_1's l2: 101.477 [9] training's rmse: 8.6544 training's l2: 74.8987 valid_1's rmse: 9.50705 valid_1's l2: 90.3841 [10] training's rmse: 8.20439 training's l2: 67.3121 valid_1's rmse: 9.09513 valid_1's l2: 82.7213 [11] training's rmse: 7.75514 training's l2: 60.1421 valid_1's rmse: 8.65266 valid_1's l2: 74.8685 [12] training's rmse: 7.38131 training's l2: 54.4838 valid_1's rmse: 8.3338 valid_1's l2: 69.4522 [13] training's rmse: 7.03408 training's l2: 49.4783 valid_1's rmse: 8.04845 valid_1's l2: 64.7775 [14] training's rmse: 6.73493 training's l2: 45.3592 valid_1's rmse: 7.78589 valid_1's l2: 60.62 [15] training's rmse: 6.43785 training's l2: 41.4459 valid_1's rmse: 7.51116 valid_1's l2: 56.4176 [16] training's rmse: 6.18591 training's l2: 38.2655 valid_1's rmse: 7.27966 valid_1's l2: 52.9934 [17] training's rmse: 5.94915 training's l2: 35.3924 valid_1's rmse: 7.06323 valid_1's l2: 49.8891 [18] training's rmse: 5.74515 training's l2: 33.0068 valid_1's rmse: 6.88646 valid_1's l2: 47.4233 [19] training's rmse: 5.58576 training's l2: 31.2007 valid_1's rmse: 6.73589 valid_1's l2: 45.3722 [20] training's rmse: 5.39577 training's l2: 29.1143 valid_1's rmse: 6.5759 valid_1's l2: 43.2424 [21] training's rmse: 5.24898 training's l2: 27.5518 valid_1's rmse: 6.42246 valid_1's l2: 41.248 [22] training's rmse: 5.11049 training's l2: 26.1171 valid_1's rmse: 6.29862 valid_1's l2: 39.6726 [23] training's rmse: 4.99411 training's l2: 24.9411 valid_1's rmse: 6.1928 valid_1's l2: 38.3508 [24] training's rmse: 4.88371 training's l2: 23.8507 valid_1's rmse: 6.0728 valid_1's l2: 36.8789 [25] training's rmse: 4.78202 training's l2: 22.8677 valid_1's rmse: 5.9935 valid_1's l2: 35.922 [26] training's rmse: 4.6838 training's l2: 21.9379 valid_1's rmse: 5.93311 valid_1's l2: 35.2017 [27] training's rmse: 4.60325 training's l2: 21.1899 valid_1's rmse: 5.87157 valid_1's l2: 34.4754 [28] training's rmse: 4.52548 training's l2: 20.48 valid_1's rmse: 5.82334 valid_1's l2: 33.9113 [29] training's rmse: 4.44991 training's l2: 19.8017 valid_1's rmse: 5.77468 valid_1's l2: 33.3469 [30] training's rmse: 4.37269 training's l2: 19.1204 valid_1's rmse: 5.68979 valid_1's l2: 32.3737 [31] training's rmse: 4.30762 training's l2: 18.5556 valid_1's rmse: 5.63551 valid_1's l2: 31.759 [32] training's rmse: 4.25337 training's l2: 18.0912 valid_1's rmse: 5.60232 valid_1's l2: 31.386 [33] training's rmse: 4.18021 training's l2: 17.4742 valid_1's rmse: 5.54555 valid_1's l2: 30.7531 [34] training's rmse: 4.11257 training's l2: 16.9133 valid_1's rmse: 5.4949 valid_1's l2: 30.1939 [35] training's rmse: 4.07011 training's l2: 16.5658 valid_1's rmse: 5.46429 valid_1's l2: 29.8584 [36] training's rmse: 4.03494 training's l2: 16.2807 valid_1's rmse: 5.45052 valid_1's l2: 29.7082 [37] training's rmse: 3.97835 training's l2: 15.8273 valid_1's rmse: 5.40469 valid_1's l2: 29.2106 [38] training's rmse: 3.93409 training's l2: 15.4771 valid_1's rmse: 5.3642 valid_1's l2: 28.7747 [39] training's rmse: 3.88931 training's l2: 15.1267 valid_1's rmse: 5.31876 valid_1's l2: 28.2892 [40] training's rmse: 3.8437 training's l2: 14.774 valid_1's rmse: 5.28403 valid_1's l2: 27.9209 [41] training's rmse: 3.79465 training's l2: 14.3994 valid_1's rmse: 5.26055 valid_1's l2: 27.6734 [42] training's rmse: 3.75298 training's l2: 14.0849 valid_1's rmse: 5.22139 valid_1's l2: 27.263 [43] training's rmse: 3.70979 training's l2: 13.7625 valid_1's rmse: 5.1818 valid_1's l2: 26.851 [44] training's rmse: 3.6679 training's l2: 13.4535 valid_1's rmse: 5.14733 valid_1's l2: 26.495 [45] training's rmse: 3.62968 training's l2: 13.1746 valid_1's rmse: 5.10245 valid_1's l2: 26.035 [46] training's rmse: 3.60196 training's l2: 12.9741 valid_1's rmse: 5.08589 valid_1's l2: 25.8663 [47] training's rmse: 3.5731 training's l2: 12.767 valid_1's rmse: 5.08055 valid_1's l2: 25.812 [48] training's rmse: 3.53759 training's l2: 12.5145 valid_1's rmse: 5.04877 valid_1's l2: 25.4901 [49] training's rmse: 3.50668 training's l2: 12.2968 valid_1's rmse: 5.03083 valid_1's l2: 25.3093 [50] training's rmse: 3.48328 training's l2: 12.1332 valid_1's rmse: 5.02083 valid_1's l2: 25.2088 [51] training's rmse: 3.45751 training's l2: 11.9544 valid_1's rmse: 5.01198 valid_1's l2: 25.12 [52] training's rmse: 3.42888 training's l2: 11.7572 valid_1's rmse: 5.01252 valid_1's l2: 25.1254 [53] training's rmse: 3.40313 training's l2: 11.5813 valid_1's rmse: 4.9984 valid_1's l2: 24.984 [54] training's rmse: 3.37252 training's l2: 11.3739 valid_1's rmse: 4.97026 valid_1's l2: 24.7035 [55] training's rmse: 3.34096 training's l2: 11.162 valid_1's rmse: 4.95623 valid_1's l2: 24.5643 [56] training's rmse: 3.31735 training's l2: 11.0048 valid_1's rmse: 4.95041 valid_1's l2: 24.5065 [57] training's rmse: 3.29165 training's l2: 10.835 valid_1's rmse: 4.93416 valid_1's l2: 24.3459 [58] training's rmse: 3.26171 training's l2: 10.6387 valid_1's rmse: 4.89762 valid_1's l2: 23.9867 [59] training's rmse: 3.23837 training's l2: 10.487 valid_1's rmse: 4.89401 valid_1's l2: 23.9513 [60] training's rmse: 3.21252 training's l2: 10.3203 valid_1's rmse: 4.87365 valid_1's l2: 23.7525 [61] training's rmse: 3.19347 training's l2: 10.1983 valid_1's rmse: 4.86666 valid_1's l2: 23.6844 [62] training's rmse: 3.17175 training's l2: 10.06 valid_1's rmse: 4.85657 valid_1's l2: 23.5862 [63] training's rmse: 3.15 training's l2: 9.9225 valid_1's rmse: 4.8538 valid_1's l2: 23.5594 [64] training's rmse: 3.13227 training's l2: 9.8111 valid_1's rmse: 4.85109 valid_1's l2: 23.5331 [65] training's rmse: 3.11103 training's l2: 9.67848 valid_1's rmse: 4.8352 valid_1's l2: 23.3791 [66] training's rmse: 3.08987 training's l2: 9.54728 valid_1's rmse: 4.81465 valid_1's l2: 23.1809 [67] training's rmse: 3.06525 training's l2: 9.39577 valid_1's rmse: 4.79707 valid_1's l2: 23.0119 [68] training's rmse: 3.05074 training's l2: 9.30699 valid_1's rmse: 4.78627 valid_1's l2: 22.9084 [69] training's rmse: 3.03288 training's l2: 9.19834 valid_1's rmse: 4.78715 valid_1's l2: 22.9168 [70] training's rmse: 3.00981 training's l2: 9.05894 valid_1's rmse: 4.77466 valid_1's l2: 22.7974 [71] training's rmse: 2.99334 training's l2: 8.96009 valid_1's rmse: 4.75882 valid_1's l2: 22.6464 [72] training's rmse: 2.97361 training's l2: 8.84238 valid_1's rmse: 4.74917 valid_1's l2: 22.5546 [73] training's rmse: 2.95756 training's l2: 8.74718 valid_1's rmse: 4.73996 valid_1's l2: 22.4673 [74] training's rmse: 2.93902 training's l2: 8.63781 valid_1's rmse: 4.72573 valid_1's l2: 22.3325 [75] training's rmse: 2.91811 training's l2: 8.51535 valid_1's rmse: 4.72045 valid_1's l2: 22.2826 [76] training's rmse: 2.89995 training's l2: 8.40972 valid_1's rmse: 4.70953 valid_1's l2: 22.1797 [77] training's rmse: 2.88542 training's l2: 8.32564 valid_1's rmse: 4.7035 valid_1's l2: 22.1229 [78] training's rmse: 2.87102 training's l2: 8.24278 valid_1's rmse: 4.68917 valid_1's l2: 21.9883 [79] training's rmse: 2.8558 training's l2: 8.15557 valid_1's rmse: 4.68198 valid_1's l2: 21.9209 [80] training's rmse: 2.84028 training's l2: 8.06718 valid_1's rmse: 4.67738 valid_1's l2: 21.8779 [81] training's rmse: 2.82592 training's l2: 7.98584 valid_1's rmse: 4.66901 valid_1's l2: 21.7997 [82] training's rmse: 2.81122 training's l2: 7.90294 valid_1's rmse: 4.65901 valid_1's l2: 21.7064 [83] training's rmse: 2.80013 training's l2: 7.84074 valid_1's rmse: 4.66036 valid_1's l2: 21.7189 [84] training's rmse: 2.78682 training's l2: 7.76638 valid_1's rmse: 4.65767 valid_1's l2: 21.6939 [85] training's rmse: 2.76732 training's l2: 7.65805 valid_1's rmse: 4.63412 valid_1's l2: 21.475 [86] training's rmse: 2.75453 training's l2: 7.58741 valid_1's rmse: 4.6284 valid_1's l2: 21.4221 [87] training's rmse: 2.74328 training's l2: 7.5256 valid_1's rmse: 4.62348 valid_1's l2: 21.3766 [88] training's rmse: 2.72969 training's l2: 7.45122 valid_1's rmse: 4.60964 valid_1's l2: 21.2488 [89] training's rmse: 2.7164 training's l2: 7.37881 valid_1's rmse: 4.60676 valid_1's l2: 21.2223 [90] training's rmse: 2.70022 training's l2: 7.29119 valid_1's rmse: 4.59356 valid_1's l2: 21.1008 [91] training's rmse: 2.68559 training's l2: 7.21239 valid_1's rmse: 4.59992 valid_1's l2: 21.1592 [92] training's rmse: 2.67318 training's l2: 7.14591 valid_1's rmse: 4.5906 valid_1's l2: 21.0736 [93] training's rmse: 2.65753 training's l2: 7.06245 valid_1's rmse: 4.57854 valid_1's l2: 20.963 [94] training's rmse: 2.64045 training's l2: 6.97195 valid_1's rmse: 4.57147 valid_1's l2: 20.8983 [95] training's rmse: 2.62928 training's l2: 6.91314 valid_1's rmse: 4.56261 valid_1's l2: 20.8174 [96] training's rmse: 2.62207 training's l2: 6.87526 valid_1's rmse: 4.56117 valid_1's l2: 20.8043 [97] training's rmse: 2.61295 training's l2: 6.82753 valid_1's rmse: 4.55474 valid_1's l2: 20.7457 [98] training's rmse: 2.60232 training's l2: 6.77209 valid_1's rmse: 4.55082 valid_1's l2: 20.71 [99] training's rmse: 2.59244 training's l2: 6.72076 valid_1's rmse: 4.53767 valid_1's l2: 20.5904 [100] training's rmse: 2.58357 training's l2: 6.67482 valid_1's rmse: 4.54135 valid_1's l2: 20.6239 Did not meet early stopping. Best iteration is: [100] training's rmse: 2.58357 training's l2: 6.67482 valid_1's rmse: 4.54135 valid_1's l2: 20.6239
LGBMRegressor(importance_type='gain', objective='regression', random_state=24,
silent=False)y_pred_lgbmr, rmse_lgbmr, mae_lgbmr, r2_lgbmr = build_regressor(lgbmr, X_val, y_val)
Model Score (Adjusted R2) of Train set : 0.976 Model RMSE MAE R2 Regressor on Validation Set 4.54 3.20 0.93
plot_feature_importance(lgbmr)
Amount of variance explained by the predictors
| feature | importance | cumulative_sum | |
|---|---|---|---|
| 5 | age | 320707.904 | 32070790.4 |
| 6 | water_cement_ratio | 310961.627 | 63166953.1 |
| 7 | CLUSTER | 78375.261 | 71004479.2 |
| 0 | slag | 65958.125 | 77600291.7 |
| 2 | superplastic | 41181.565 | 81718448.2 |
| 4 | fineagg | 33647.383 | 85083186.5 |
| 3 | coarseagg | 24959.522 | 87579138.7 |
| 1 | ash | 8679.036 | 88447042.3 |
param_grid_lgb = {
'n_estimators' : [100, 200, 300, 500],
'max_depth' : [3,4,5,6], #Tree Depth
'learning_rate' : [0.1,0.01,0.05], #Shrinkage Parameter to avoid being Overfit
'reg_alpha' : [0,1.0,10.0], #Regularization Parameter
'reg_lambda' : [0,1.0,10.0] #Regularization Parameter
}
lgbmr_tuned = GridSearchCV(
lgb.LGBMRegressor(boosting_type='gbdt', objective='regression', random_state=24),
param_grid=param_grid_lgb,
scoring='r2',
cv=KFold(n_splits=10),
n_jobs=-1
)
lgbmr_tuned.fit(
X_train,
y_train,
eval_set=[(X_train, y_train), (X_val, y_val)],
eval_metric='rmse',
early_stopping_rounds=10,
verbose=True
)
[1] training's rmse: 15.5969 training's l2: 243.265 valid_1's rmse: 16.151 valid_1's l2: 260.855 Training until validation scores don't improve for 10 rounds [2] training's rmse: 14.6214 training's l2: 213.785 valid_1's rmse: 15.1994 valid_1's l2: 231.02 [3] training's rmse: 13.7566 training's l2: 189.244 valid_1's rmse: 14.3235 valid_1's l2: 205.162 [4] training's rmse: 12.9342 training's l2: 167.294 valid_1's rmse: 13.5615 valid_1's l2: 183.915 [5] training's rmse: 12.221 training's l2: 149.353 valid_1's rmse: 12.8292 valid_1's l2: 164.587 [6] training's rmse: 11.5709 training's l2: 133.886 valid_1's rmse: 12.2555 valid_1's l2: 150.197 [7] training's rmse: 10.995 training's l2: 120.889 valid_1's rmse: 11.662 valid_1's l2: 136.003 [8] training's rmse: 10.4603 training's l2: 109.417 valid_1's rmse: 11.1881 valid_1's l2: 125.174 [9] training's rmse: 9.94432 training's l2: 98.8896 valid_1's rmse: 10.6465 valid_1's l2: 113.347 [10] training's rmse: 9.5239 training's l2: 90.7046 valid_1's rmse: 10.2199 valid_1's l2: 104.447 [11] training's rmse: 9.08222 training's l2: 82.4868 valid_1's rmse: 9.76117 valid_1's l2: 95.2805 [12] training's rmse: 8.7088 training's l2: 75.8432 valid_1's rmse: 9.39075 valid_1's l2: 88.1862 [13] training's rmse: 8.35362 training's l2: 69.783 valid_1's rmse: 9.04447 valid_1's l2: 81.8024 [14] training's rmse: 8.05152 training's l2: 64.8269 valid_1's rmse: 8.80458 valid_1's l2: 77.5207 [15] training's rmse: 7.75349 training's l2: 60.1166 valid_1's rmse: 8.51257 valid_1's l2: 72.4638 [16] training's rmse: 7.50726 training's l2: 56.359 valid_1's rmse: 8.32553 valid_1's l2: 69.3145 [17] training's rmse: 7.23636 training's l2: 52.3649 valid_1's rmse: 8.06234 valid_1's l2: 65.0013 [18] training's rmse: 7.00368 training's l2: 49.0515 valid_1's rmse: 7.86303 valid_1's l2: 61.8273 [19] training's rmse: 6.79894 training's l2: 46.2256 valid_1's rmse: 7.6869 valid_1's l2: 59.0884 [20] training's rmse: 6.60689 training's l2: 43.651 valid_1's rmse: 7.51496 valid_1's l2: 56.4746 [21] training's rmse: 6.42985 training's l2: 41.343 valid_1's rmse: 7.34779 valid_1's l2: 53.9901 [22] training's rmse: 6.27164 training's l2: 39.3335 valid_1's rmse: 7.21573 valid_1's l2: 52.0668 [23] training's rmse: 6.13902 training's l2: 37.6876 valid_1's rmse: 7.10791 valid_1's l2: 50.5224 [24] training's rmse: 5.99683 training's l2: 35.9619 valid_1's rmse: 6.97707 valid_1's l2: 48.6794 [25] training's rmse: 5.86654 training's l2: 34.4163 valid_1's rmse: 6.8675 valid_1's l2: 47.1625 [26] training's rmse: 5.74718 training's l2: 33.03 valid_1's rmse: 6.76981 valid_1's l2: 45.8303 [27] training's rmse: 5.6379 training's l2: 31.786 valid_1's rmse: 6.68828 valid_1's l2: 44.733 [28] training's rmse: 5.53914 training's l2: 30.682 valid_1's rmse: 6.60268 valid_1's l2: 43.5954 [29] training's rmse: 5.4527 training's l2: 29.7319 valid_1's rmse: 6.53316 valid_1's l2: 42.6822 [30] training's rmse: 5.36651 training's l2: 28.7995 valid_1's rmse: 6.47574 valid_1's l2: 41.9352 [31] training's rmse: 5.28968 training's l2: 27.9808 valid_1's rmse: 6.41237 valid_1's l2: 41.1185 [32] training's rmse: 5.21256 training's l2: 27.1707 valid_1's rmse: 6.36652 valid_1's l2: 40.5325 [33] training's rmse: 5.14577 training's l2: 26.4789 valid_1's rmse: 6.3142 valid_1's l2: 39.8691 [34] training's rmse: 5.05823 training's l2: 25.5857 valid_1's rmse: 6.25801 valid_1's l2: 39.1627 [35] training's rmse: 5.00411 training's l2: 25.0411 valid_1's rmse: 6.20764 valid_1's l2: 38.5348 [36] training's rmse: 4.93036 training's l2: 24.3084 valid_1's rmse: 6.16125 valid_1's l2: 37.961 [37] training's rmse: 4.88133 training's l2: 23.8273 valid_1's rmse: 6.13046 valid_1's l2: 37.5825 [38] training's rmse: 4.83124 training's l2: 23.3409 valid_1's rmse: 6.10585 valid_1's l2: 37.2814 [39] training's rmse: 4.77527 training's l2: 22.8032 valid_1's rmse: 6.06431 valid_1's l2: 36.7759 [40] training's rmse: 4.73216 training's l2: 22.3934 valid_1's rmse: 6.03084 valid_1's l2: 36.371 [41] training's rmse: 4.69256 training's l2: 22.0201 valid_1's rmse: 5.99199 valid_1's l2: 35.904 [42] training's rmse: 4.65684 training's l2: 21.6861 valid_1's rmse: 5.96221 valid_1's l2: 35.5479 [43] training's rmse: 4.62474 training's l2: 21.3883 valid_1's rmse: 5.9403 valid_1's l2: 35.2872 [44] training's rmse: 4.58366 training's l2: 21.01 valid_1's rmse: 5.90891 valid_1's l2: 34.9152 [45] training's rmse: 4.54943 training's l2: 20.6973 valid_1's rmse: 5.89307 valid_1's l2: 34.7282 [46] training's rmse: 4.51457 training's l2: 20.3814 valid_1's rmse: 5.8654 valid_1's l2: 34.4029 [47] training's rmse: 4.48549 training's l2: 20.1196 valid_1's rmse: 5.84611 valid_1's l2: 34.177 [48] training's rmse: 4.44984 training's l2: 19.8011 valid_1's rmse: 5.79646 valid_1's l2: 33.5989 [49] training's rmse: 4.41327 training's l2: 19.477 valid_1's rmse: 5.75945 valid_1's l2: 33.1713 [50] training's rmse: 4.38446 training's l2: 19.2235 valid_1's rmse: 5.73294 valid_1's l2: 32.8665 [51] training's rmse: 4.35745 training's l2: 18.9874 valid_1's rmse: 5.71045 valid_1's l2: 32.6092 [52] training's rmse: 4.32429 training's l2: 18.6995 valid_1's rmse: 5.69015 valid_1's l2: 32.3778 [53] training's rmse: 4.30168 training's l2: 18.5044 valid_1's rmse: 5.67194 valid_1's l2: 32.1709 [54] training's rmse: 4.25624 training's l2: 18.1156 valid_1's rmse: 5.63682 valid_1's l2: 31.7738 [55] training's rmse: 4.22951 training's l2: 17.8887 valid_1's rmse: 5.6177 valid_1's l2: 31.5586 [56] training's rmse: 4.21046 training's l2: 17.728 valid_1's rmse: 5.60302 valid_1's l2: 31.3939 [57] training's rmse: 4.18897 training's l2: 17.5475 valid_1's rmse: 5.58971 valid_1's l2: 31.2449 [58] training's rmse: 4.16328 training's l2: 17.3329 valid_1's rmse: 5.56777 valid_1's l2: 31.0001 [59] training's rmse: 4.12803 training's l2: 17.0406 valid_1's rmse: 5.54155 valid_1's l2: 30.7088 [60] training's rmse: 4.10634 training's l2: 16.862 valid_1's rmse: 5.53313 valid_1's l2: 30.6155 [61] training's rmse: 4.08493 training's l2: 16.6866 valid_1's rmse: 5.51408 valid_1's l2: 30.405 [62] training's rmse: 4.06579 training's l2: 16.5307 valid_1's rmse: 5.49178 valid_1's l2: 30.1596 [63] training's rmse: 4.03913 training's l2: 16.3146 valid_1's rmse: 5.48019 valid_1's l2: 30.0325 [64] training's rmse: 4.02528 training's l2: 16.2029 valid_1's rmse: 5.4771 valid_1's l2: 29.9987 [65] training's rmse: 4.00707 training's l2: 16.0566 valid_1's rmse: 5.46643 valid_1's l2: 29.8818 [66] training's rmse: 3.99129 training's l2: 15.9304 valid_1's rmse: 5.45257 valid_1's l2: 29.7305 [67] training's rmse: 3.97746 training's l2: 15.8202 valid_1's rmse: 5.44613 valid_1's l2: 29.6603 [68] training's rmse: 3.96431 training's l2: 15.7157 valid_1's rmse: 5.44543 valid_1's l2: 29.6527 [69] training's rmse: 3.95205 training's l2: 15.6187 valid_1's rmse: 5.43457 valid_1's l2: 29.5345 [70] training's rmse: 3.91765 training's l2: 15.348 valid_1's rmse: 5.41536 valid_1's l2: 29.3262 [71] training's rmse: 3.90342 training's l2: 15.2367 valid_1's rmse: 5.40925 valid_1's l2: 29.26 [72] training's rmse: 3.8923 training's l2: 15.15 valid_1's rmse: 5.40371 valid_1's l2: 29.2001 [73] training's rmse: 3.86028 training's l2: 14.9018 valid_1's rmse: 5.38566 valid_1's l2: 29.0053 [74] training's rmse: 3.84446 training's l2: 14.7798 valid_1's rmse: 5.38658 valid_1's l2: 29.0152 [75] training's rmse: 3.81353 training's l2: 14.543 valid_1's rmse: 5.37584 valid_1's l2: 28.8997 [76] training's rmse: 3.79292 training's l2: 14.3862 valid_1's rmse: 5.36721 valid_1's l2: 28.807 [77] training's rmse: 3.7765 training's l2: 14.2619 valid_1's rmse: 5.36371 valid_1's l2: 28.7694 [78] training's rmse: 3.76809 training's l2: 14.1985 valid_1's rmse: 5.36382 valid_1's l2: 28.7705 [79] training's rmse: 3.74392 training's l2: 14.0169 valid_1's rmse: 5.33898 valid_1's l2: 28.5047 [80] training's rmse: 3.72622 training's l2: 13.8847 valid_1's rmse: 5.32046 valid_1's l2: 28.3073 [81] training's rmse: 3.71263 training's l2: 13.7836 valid_1's rmse: 5.31893 valid_1's l2: 28.291 [82] training's rmse: 3.70131 training's l2: 13.6997 valid_1's rmse: 5.31375 valid_1's l2: 28.236 [83] training's rmse: 3.69194 training's l2: 13.6304 valid_1's rmse: 5.30489 valid_1's l2: 28.1418 [84] training's rmse: 3.67352 training's l2: 13.4948 valid_1's rmse: 5.28432 valid_1's l2: 27.924 [85] training's rmse: 3.65649 training's l2: 13.3699 valid_1's rmse: 5.28096 valid_1's l2: 27.8885 [86] training's rmse: 3.6479 training's l2: 13.3072 valid_1's rmse: 5.27079 valid_1's l2: 27.7812 [87] training's rmse: 3.63673 training's l2: 13.2258 valid_1's rmse: 5.26312 valid_1's l2: 27.7004 [88] training's rmse: 3.62367 training's l2: 13.131 valid_1's rmse: 5.25584 valid_1's l2: 27.6238 [89] training's rmse: 3.60171 training's l2: 12.9723 valid_1's rmse: 5.2431 valid_1's l2: 27.4901 [90] training's rmse: 3.59427 training's l2: 12.9188 valid_1's rmse: 5.23995 valid_1's l2: 27.457 [91] training's rmse: 3.58523 training's l2: 12.8539 valid_1's rmse: 5.23252 valid_1's l2: 27.3792 [92] training's rmse: 3.56769 training's l2: 12.7284 valid_1's rmse: 5.21681 valid_1's l2: 27.2151 [93] training's rmse: 3.55781 training's l2: 12.658 valid_1's rmse: 5.2124 valid_1's l2: 27.1692 [94] training's rmse: 3.54007 training's l2: 12.5321 valid_1's rmse: 5.20618 valid_1's l2: 27.1043 [95] training's rmse: 3.52676 training's l2: 12.438 valid_1's rmse: 5.20523 valid_1's l2: 27.0944 [96] training's rmse: 3.51931 training's l2: 12.3855 valid_1's rmse: 5.19499 valid_1's l2: 26.988 [97] training's rmse: 3.51186 training's l2: 12.3331 valid_1's rmse: 5.19482 valid_1's l2: 26.9861 [98] training's rmse: 3.49862 training's l2: 12.2404 valid_1's rmse: 5.18868 valid_1's l2: 26.9224 [99] training's rmse: 3.49265 training's l2: 12.1986 valid_1's rmse: 5.18118 valid_1's l2: 26.8446 [100] training's rmse: 3.48612 training's l2: 12.153 valid_1's rmse: 5.18307 valid_1's l2: 26.8642 [101] training's rmse: 3.4796 training's l2: 12.1076 valid_1's rmse: 5.18044 valid_1's l2: 26.837 [102] training's rmse: 3.47211 training's l2: 12.0555 valid_1's rmse: 5.1763 valid_1's l2: 26.7941 [103] training's rmse: 3.45735 training's l2: 11.9533 valid_1's rmse: 5.1561 valid_1's l2: 26.5853 [104] training's rmse: 3.44257 training's l2: 11.8513 valid_1's rmse: 5.14678 valid_1's l2: 26.4893 [105] training's rmse: 3.4301 training's l2: 11.7656 valid_1's rmse: 5.13426 valid_1's l2: 26.3607 [106] training's rmse: 3.42398 training's l2: 11.7236 valid_1's rmse: 5.12733 valid_1's l2: 26.2895 [107] training's rmse: 3.41769 training's l2: 11.6806 valid_1's rmse: 5.1334 valid_1's l2: 26.3518 [108] training's rmse: 3.40754 training's l2: 11.6114 valid_1's rmse: 5.11763 valid_1's l2: 26.1902 [109] training's rmse: 3.39791 training's l2: 11.5458 valid_1's rmse: 5.11987 valid_1's l2: 26.213 [110] training's rmse: 3.39377 training's l2: 11.5177 valid_1's rmse: 5.11275 valid_1's l2: 26.1402 [111] training's rmse: 3.38206 training's l2: 11.4383 valid_1's rmse: 5.11409 valid_1's l2: 26.1539 [112] training's rmse: 3.37677 training's l2: 11.4025 valid_1's rmse: 5.1061 valid_1's l2: 26.0722 [113] training's rmse: 3.3621 training's l2: 11.3037 valid_1's rmse: 5.09303 valid_1's l2: 25.9389 [114] training's rmse: 3.34784 training's l2: 11.208 valid_1's rmse: 5.08556 valid_1's l2: 25.8629 [115] training's rmse: 3.33793 training's l2: 11.1418 valid_1's rmse: 5.07601 valid_1's l2: 25.7659 [116] training's rmse: 3.32731 training's l2: 11.071 valid_1's rmse: 5.06809 valid_1's l2: 25.6855 [117] training's rmse: 3.32086 training's l2: 11.0281 valid_1's rmse: 5.05998 valid_1's l2: 25.6034 [118] training's rmse: 3.30776 training's l2: 10.9413 valid_1's rmse: 5.06282 valid_1's l2: 25.6321 [119] training's rmse: 3.28759 training's l2: 10.8083 valid_1's rmse: 5.0618 valid_1's l2: 25.6218 [120] training's rmse: 3.26769 training's l2: 10.6778 valid_1's rmse: 5.05076 valid_1's l2: 25.5102 [121] training's rmse: 3.25506 training's l2: 10.5954 valid_1's rmse: 5.04693 valid_1's l2: 25.4715 [122] training's rmse: 3.24244 training's l2: 10.5134 valid_1's rmse: 5.03945 valid_1's l2: 25.396 [123] training's rmse: 3.23773 training's l2: 10.4829 valid_1's rmse: 5.03315 valid_1's l2: 25.3326 [124] training's rmse: 3.23105 training's l2: 10.4397 valid_1's rmse: 5.0233 valid_1's l2: 25.2335 [125] training's rmse: 3.21271 training's l2: 10.3215 valid_1's rmse: 5.01409 valid_1's l2: 25.1411 [126] training's rmse: 3.20789 training's l2: 10.2905 valid_1's rmse: 5.00636 valid_1's l2: 25.0637 [127] training's rmse: 3.2039 training's l2: 10.265 valid_1's rmse: 5.00426 valid_1's l2: 25.0426 [128] training's rmse: 3.19764 training's l2: 10.2249 valid_1's rmse: 5.00107 valid_1's l2: 25.0107 [129] training's rmse: 3.18782 training's l2: 10.1622 valid_1's rmse: 4.98651 valid_1's l2: 24.8652 [130] training's rmse: 3.18081 training's l2: 10.1175 valid_1's rmse: 4.97501 valid_1's l2: 24.7508 [131] training's rmse: 3.17001 training's l2: 10.049 valid_1's rmse: 4.95976 valid_1's l2: 24.5993 [132] training's rmse: 3.16268 training's l2: 10.0025 valid_1's rmse: 4.94801 valid_1's l2: 24.4828 [133] training's rmse: 3.15866 training's l2: 9.97716 valid_1's rmse: 4.94039 valid_1's l2: 24.4074 [134] training's rmse: 3.15123 training's l2: 9.93027 valid_1's rmse: 4.9297 valid_1's l2: 24.302 [135] training's rmse: 3.13434 training's l2: 9.82411 valid_1's rmse: 4.92704 valid_1's l2: 24.2757 [136] training's rmse: 3.12765 training's l2: 9.78221 valid_1's rmse: 4.92625 valid_1's l2: 24.268 [137] training's rmse: 3.10753 training's l2: 9.65676 valid_1's rmse: 4.9211 valid_1's l2: 24.2172 [138] training's rmse: 3.10247 training's l2: 9.62532 valid_1's rmse: 4.92135 valid_1's l2: 24.2197 [139] training's rmse: 3.09156 training's l2: 9.55776 valid_1's rmse: 4.91772 valid_1's l2: 24.1839 [140] training's rmse: 3.08128 training's l2: 9.49431 valid_1's rmse: 4.913 valid_1's l2: 24.1375 [141] training's rmse: 3.0774 training's l2: 9.47041 valid_1's rmse: 4.90785 valid_1's l2: 24.087 [142] training's rmse: 3.07308 training's l2: 9.44379 valid_1's rmse: 4.90445 valid_1's l2: 24.0537 [143] training's rmse: 3.06562 training's l2: 9.398 valid_1's rmse: 4.90592 valid_1's l2: 24.0681 [144] training's rmse: 3.05586 training's l2: 9.33825 valid_1's rmse: 4.90664 valid_1's l2: 24.0752 [145] training's rmse: 3.05069 training's l2: 9.3067 valid_1's rmse: 4.90853 valid_1's l2: 24.0937 [146] training's rmse: 3.0386 training's l2: 9.23309 valid_1's rmse: 4.90223 valid_1's l2: 24.0319 [147] training's rmse: 3.03408 training's l2: 9.20562 valid_1's rmse: 4.89937 valid_1's l2: 24.0038 [148] training's rmse: 3.02298 training's l2: 9.1384 valid_1's rmse: 4.89514 valid_1's l2: 23.9624 [149] training's rmse: 3.01892 training's l2: 9.11389 valid_1's rmse: 4.89496 valid_1's l2: 23.9606 [150] training's rmse: 3.01378 training's l2: 9.08284 valid_1's rmse: 4.88381 valid_1's l2: 23.8516 [151] training's rmse: 3.00326 training's l2: 9.0196 valid_1's rmse: 4.87712 valid_1's l2: 23.7863 [152] training's rmse: 2.99833 training's l2: 8.98998 valid_1's rmse: 4.87815 valid_1's l2: 23.7964 [153] training's rmse: 2.99464 training's l2: 8.96787 valid_1's rmse: 4.88211 valid_1's l2: 23.835 [154] training's rmse: 2.99058 training's l2: 8.94355 valid_1's rmse: 4.87883 valid_1's l2: 23.803 [155] training's rmse: 2.98065 training's l2: 8.88425 valid_1's rmse: 4.87001 valid_1's l2: 23.717 [156] training's rmse: 2.97378 training's l2: 8.84339 valid_1's rmse: 4.867 valid_1's l2: 23.6877 [157] training's rmse: 2.96378 training's l2: 8.78399 valid_1's rmse: 4.86227 valid_1's l2: 23.6417 [158] training's rmse: 2.95952 training's l2: 8.75876 valid_1's rmse: 4.8646 valid_1's l2: 23.6643 [159] training's rmse: 2.95323 training's l2: 8.72154 valid_1's rmse: 4.85738 valid_1's l2: 23.5941 [160] training's rmse: 2.94296 training's l2: 8.66103 valid_1's rmse: 4.85702 valid_1's l2: 23.5906 [161] training's rmse: 2.93767 training's l2: 8.62988 valid_1's rmse: 4.85538 valid_1's l2: 23.5747 [162] training's rmse: 2.934 training's l2: 8.60837 valid_1's rmse: 4.85786 valid_1's l2: 23.5988 [163] training's rmse: 2.9281 training's l2: 8.57378 valid_1's rmse: 4.8539 valid_1's l2: 23.5604 [164] training's rmse: 2.91787 training's l2: 8.51396 valid_1's rmse: 4.84912 valid_1's l2: 23.514 [165] training's rmse: 2.91283 training's l2: 8.48457 valid_1's rmse: 4.84711 valid_1's l2: 23.4944 [166] training's rmse: 2.90329 training's l2: 8.42909 valid_1's rmse: 4.83967 valid_1's l2: 23.4224 [167] training's rmse: 2.89983 training's l2: 8.40902 valid_1's rmse: 4.83996 valid_1's l2: 23.4252 [168] training's rmse: 2.89061 training's l2: 8.35561 valid_1's rmse: 4.83249 valid_1's l2: 23.3529 [169] training's rmse: 2.88635 training's l2: 8.331 valid_1's rmse: 4.82384 valid_1's l2: 23.2694 [170] training's rmse: 2.87539 training's l2: 8.26788 valid_1's rmse: 4.82192 valid_1's l2: 23.2509 [171] training's rmse: 2.863 training's l2: 8.19675 valid_1's rmse: 4.81012 valid_1's l2: 23.1373 [172] training's rmse: 2.85823 training's l2: 8.16949 valid_1's rmse: 4.80804 valid_1's l2: 23.1173 [173] training's rmse: 2.85412 training's l2: 8.14601 valid_1's rmse: 4.79709 valid_1's l2: 23.0121 [174] training's rmse: 2.84587 training's l2: 8.09899 valid_1's rmse: 4.792 valid_1's l2: 22.9633 [175] training's rmse: 2.84201 training's l2: 8.07703 valid_1's rmse: 4.79325 valid_1's l2: 22.9752 [176] training's rmse: 2.83787 training's l2: 8.05351 valid_1's rmse: 4.7907 valid_1's l2: 22.9508 [177] training's rmse: 2.834 training's l2: 8.03155 valid_1's rmse: 4.78251 valid_1's l2: 22.8724 [178] training's rmse: 2.83168 training's l2: 8.01841 valid_1's rmse: 4.78219 valid_1's l2: 22.8693 [179] training's rmse: 2.82869 training's l2: 8.00149 valid_1's rmse: 4.77905 valid_1's l2: 22.8393 [180] training's rmse: 2.82233 training's l2: 7.96555 valid_1's rmse: 4.77253 valid_1's l2: 22.777 [181] training's rmse: 2.81911 training's l2: 7.94739 valid_1's rmse: 4.77222 valid_1's l2: 22.774 [182] training's rmse: 2.8061 training's l2: 7.8742 valid_1's rmse: 4.77153 valid_1's l2: 22.7675 [183] training's rmse: 2.8031 training's l2: 7.85737 valid_1's rmse: 4.77193 valid_1's l2: 22.7713 [184] training's rmse: 2.79091 training's l2: 7.78919 valid_1's rmse: 4.77387 valid_1's l2: 22.7899 [185] training's rmse: 2.78505 training's l2: 7.7565 valid_1's rmse: 4.76836 valid_1's l2: 22.7372 [186] training's rmse: 2.78126 training's l2: 7.73541 valid_1's rmse: 4.76473 valid_1's l2: 22.7027 [187] training's rmse: 2.77085 training's l2: 7.67763 valid_1's rmse: 4.75757 valid_1's l2: 22.6345 [188] training's rmse: 2.76333 training's l2: 7.63599 valid_1's rmse: 4.75412 valid_1's l2: 22.6017 [189] training's rmse: 2.75173 training's l2: 7.572 valid_1's rmse: 4.7477 valid_1's l2: 22.5406 [190] training's rmse: 2.74858 training's l2: 7.55467 valid_1's rmse: 4.74993 valid_1's l2: 22.5619 [191] training's rmse: 2.73626 training's l2: 7.48712 valid_1's rmse: 4.74814 valid_1's l2: 22.5448 [192] training's rmse: 2.7282 training's l2: 7.44309 valid_1's rmse: 4.74647 valid_1's l2: 22.529 [193] training's rmse: 2.7232 training's l2: 7.41581 valid_1's rmse: 4.74301 valid_1's l2: 22.4962 [194] training's rmse: 2.71846 training's l2: 7.39003 valid_1's rmse: 4.73861 valid_1's l2: 22.4544 [195] training's rmse: 2.70792 training's l2: 7.33285 valid_1's rmse: 4.73605 valid_1's l2: 22.4301 [196] training's rmse: 2.70306 training's l2: 7.30656 valid_1's rmse: 4.73157 valid_1's l2: 22.3877 [197] training's rmse: 2.6936 training's l2: 7.25547 valid_1's rmse: 4.732 valid_1's l2: 22.3918 [198] training's rmse: 2.68548 training's l2: 7.2118 valid_1's rmse: 4.72314 valid_1's l2: 22.3081 [199] training's rmse: 2.67565 training's l2: 7.15909 valid_1's rmse: 4.72864 valid_1's l2: 22.36 [200] training's rmse: 2.66859 training's l2: 7.12136 valid_1's rmse: 4.72044 valid_1's l2: 22.2826 [201] training's rmse: 2.66548 training's l2: 7.10478 valid_1's rmse: 4.7177 valid_1's l2: 22.2567 [202] training's rmse: 2.65729 training's l2: 7.0612 valid_1's rmse: 4.71566 valid_1's l2: 22.2375 [203] training's rmse: 2.65444 training's l2: 7.04608 valid_1's rmse: 4.71541 valid_1's l2: 22.2351 [204] training's rmse: 2.64861 training's l2: 7.01516 valid_1's rmse: 4.7212 valid_1's l2: 22.2897 [205] training's rmse: 2.64584 training's l2: 7.00047 valid_1's rmse: 4.72254 valid_1's l2: 22.3024 [206] training's rmse: 2.64067 training's l2: 6.97316 valid_1's rmse: 4.72826 valid_1's l2: 22.3565 [207] training's rmse: 2.63862 training's l2: 6.96234 valid_1's rmse: 4.72989 valid_1's l2: 22.3719 [208] training's rmse: 2.63125 training's l2: 6.92347 valid_1's rmse: 4.7274 valid_1's l2: 22.3483 [209] training's rmse: 2.62856 training's l2: 6.90933 valid_1's rmse: 4.72394 valid_1's l2: 22.3156 [210] training's rmse: 2.62344 training's l2: 6.88241 valid_1's rmse: 4.72451 valid_1's l2: 22.321 [211] training's rmse: 2.62109 training's l2: 6.87013 valid_1's rmse: 4.72221 valid_1's l2: 22.2993 [212] training's rmse: 2.61627 training's l2: 6.84488 valid_1's rmse: 4.71624 valid_1's l2: 22.2429 [213] training's rmse: 2.61404 training's l2: 6.83319 valid_1's rmse: 4.71558 valid_1's l2: 22.2367 Early stopping, best iteration is: [203] training's rmse: 2.65444 training's l2: 7.04608 valid_1's rmse: 4.71541 valid_1's l2: 22.2351
GridSearchCV(cv=KFold(n_splits=10, random_state=None, shuffle=False),
estimator=LGBMRegressor(objective='regression', random_state=24),
n_jobs=-1,
param_grid={'learning_rate': [0.1, 0.01, 0.05],
'max_depth': [3, 4, 5, 6],
'n_estimators': [100, 200, 300, 500],
'reg_alpha': [0, 1.0, 10.0],
'reg_lambda': [0, 1.0, 10.0]},
scoring='r2')LGBMRegressor(objective='regression', random_state=24)
lgbmr_tuned.best_params_
{'learning_rate': 0.1,
'max_depth': 6,
'n_estimators': 500,
'reg_alpha': 0,
'reg_lambda': 10.0}
y_pred_lgbmr_tuned, rmse_lgbmr_tuned, mae_lgbmr_tuned, r2_lgbmr_tuned = build_regressor(lgbmr_tuned, X_val, y_val)
Model Score (Adjusted R2) of Train set : 0.975 Model RMSE MAE R2 Regressor on Validation Set 4.72 3.37 0.92
lgbmr_cv_score = plot_cross_val_score(lgbmr_tuned, X_val, y_val, cv=10, alpha=0.95, scoring = 'r2')
Cross validation score (Mean): 0.814 Cross validation score (Std Dev): 0.084 CV Score Mean+-3SD : [ 0.6447945285030795, 0.9824629214222119] 95.0 confidence interval 65.9% and 90.6%
def plot_learning_curve(estimator, X, y, ax, ylim = None, cv = None, n_jobs = 1,
train_sizes = np.linspace(.1, 1.0, 5), name = 'Learning Curve'):
if ylim is not None:
plt.ylim(*ylim)
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv = cv, n_jobs = n_jobs,
train_sizes = train_sizes)
train_scores_mean = np.mean(train_scores, axis = 1)
train_scores_std = np.std(train_scores, axis = 1)
test_scores_mean = np.mean(test_scores, axis = 1)
test_scores_std = np.std(test_scores, axis = 1)
ax.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std,
alpha = 0.1, color = '#ff9124')
ax.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std,
alpha = 0.1, color = '#2492ff')
ax.plot(train_sizes, train_scores_mean, 'o-', color = '#ff9124', label = 'Training score')
ax.plot(train_sizes, test_scores_mean, 'o-', color ='#2492ff', label = 'Cross-validation score')
ax.set_title(name, fontsize = 14)
ax.set_xlabel('Training size')
ax.set_ylabel('Score')
ax.grid(True)
ax.legend(loc = 'best')
# Plot training vs cross validation scores
cv = KFold(n_splits = 30, random_state = random_state)
f, ((ax1, ax2, ax3)) = plt.subplots(1, 3, figsize = (15, 8))
f.suptitle('Training vs Cross Validation Scores', fontsize = 14)
plot_learning_curve(rf_reg, X_train, y_train, cv = cv, n_jobs = -1, ax = ax1,
name = 'RF Regressor')
plot_learning_curve(xgbr, X_train, y_train, cv = cv, n_jobs = -1, ax = ax2,
name = 'XGB Regressor')
plot_learning_curve(lgbmr, X_train, y_train, cv = cv, n_jobs = 1, ax = ax3,
name = 'LGBM Regressor')
Observations:
plt.figure(figsize=(10,6))
plt.boxplot([xgbr_cv_score, lgbmr_cv_score],
showmeans=True,
labels=['XGB CV','LGBM CV'])
plt.title('Cross Val Score - XGB vs LGBM')
plt.show()
metrics = {'Regressor':['Linear Regression',
'Linear Regression - Polynomial Features',
'Support Vector Machine RBF Kernel',
'Decision Tree Regressor',
'Random Forest Regressor',
'Random Forest Regressor - Tuned',
'XGBoost Regressor',
'XGB Regressor - Tuned',
'LightGBM Regressor',
'LightGBM Regressor - Tuned'],
'RMSE' : [rmse_lr, rmse_poly_lr, rmse_svr, rmse_dtr, rmse_rfr, rmse_rfr_tuned, rmse_xgbr, rmse_xgbr_tuned, rmse_lgbmr, rmse_lgbmr_tuned],
'MAE' : [mae_lr, mae_poly_lr, mae_svr, mae_dtr, mae_rfr, mae_rfr_tuned, mae_xgbr, mae_xgbr_tuned, mae_lgbmr, mae_lgbmr_tuned],
'R2' : [r2_lr, r2_poly_lr, r2_svr, r2_dtr, r2_rfr, r2_rfr_tuned, r2_xgbr, r2_xgbr_tuned, r2_lgbmr, r2_lgbmr_tuned]
}
model_eval_metrics = pd.DataFrame(metrics)
model_eval_metrics = model_eval_metrics.set_index('Regressor')
model_eval_metrics
| RMSE | MAE | R2 | |
|---|---|---|---|
| Regressor | |||
| Linear Regression | 10.884346 | 8.732527 | 0.600217 |
| Linear Regression - Polynomial Features | 7.569204 | 5.878866 | 0.806661 |
| Support Vector Machine RBF Kernel | 10.436451 | 8.472344 | 0.632443 |
| Decision Tree Regressor | 6.161474 | 4.169757 | 0.871888 |
| Random Forest Regressor | 5.430297 | 3.848853 | 0.900490 |
| XGBoost Regressor | 4.598941 | 3.200675 | 0.928627 |
| XGB Regressor - Tuned | 4.488788 | 3.115504 | 0.932005 |
| LightGBM Regressor | 4.541353 | 3.199917 | 0.930403 |
# check if difference between algorithms is real
t, p = paired_ttest_5x2cv(estimator1=xgbr,
estimator2=lgbmr,
X=X_val,
y=y_val,
scoring='r2',
random_seed=24)
print(f'The P-value is = {p:.3f}')
print(f'The t-statistics is = {t:.3f}')
# interpret the result
if p <= 0.05:
print('Since p<0.05, We can reject the null-hypothesis that both models perform equally well on this dataset.\nWe may conclude that the two algorithms are significantly different.')
else:
print('Since p>0.05, we cannot reject the null hypothesis.\n We may conclude that the performance of the two algorithms is not significantly different.')
The P-value is = 0.544 The t-statistics is = -0.651 Since p>0.05, we cannot reject the null hypothesis. We may conclude that the performance of the two algorithms is not significantly different.
y_pred_rfr_tuned_test, rmse_rfr_tuned_test, mae_rfr_tuned_test, r2_rfr_tuned_test = build_regressor(rfr_tuned, X_test, y_test)
y_pred_xgbr_tuned_test, rmse_xgbr_tuned_test, mae_xgbr_tuned_test, r2_xgbr_tuned_test = build_regressor(xgbr_tuned, X_test, y_test)
y_pred_lgbmr_tuned_test, rmse_lgbmr_tuned_test, mae_lgbmr_tuned_test, r2_lgbmr_tuned_test = build_regressor(lgbmr_tuned, X_test, y_test)
Model Score (Adjusted R2) of Train set : 0.917 Model RMSE MAE R2 Regressor on Validation Set 6.50 4.99 0.84
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 138 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 200 out of 200 | elapsed: 0.0s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 17 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 138 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 200 out of 200 | elapsed: 0.0s finished
Model Score (Adjusted R2) of Train set : 0.982 Model RMSE MAE R2 Regressor on Validation Set 4.57 3.18 0.92
Model Score (Adjusted R2) of Train set : 0.975 Model RMSE MAE R2 Regressor on Validation Set 4.69 3.18 0.92
metrics_test = {'Regressor on Test set':['Random Forest Regressor',
'XGBoost Regressor',
'LightGBM Regressor'],
'RMSE' : [rmse_rfr_tuned_test, rmse_xgbr_tuned_test, rmse_lgbmr_tuned_test],
'MAE' : [mae_rfr_tuned_test, mae_xgbr_tuned_test, mae_lgbmr_tuned_test],
'R2' : [r2_rfr_tuned_test, r2_xgbr_tuned_test, r2_lgbmr_tuned_test]
}
model_eval_metrics_test = pd.DataFrame(metrics_test)
model_eval_metrics_test = model_eval_metrics_test.set_index('Regressor on Test set')
model_eval_metrics_test
| RMSE | MAE | R2 | |
|---|---|---|---|
| Regressor on Test set | |||
| Random Forest Regressor | 6.497063 | 4.994276 | 0.838635 |
| XGBoost Regressor | 4.574537 | 3.180451 | 0.920004 |
| LightGBM Regressor | 4.691549 | 3.178575 | 0.915859 |
# Predicted vs Ground-Truth Plot for all the Models
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16,8))
ax1.scatter(y_pred_rfr_tuned_test, y_test, s=20)
ax1.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
ax1.set_ylabel("True")
ax1.set_xlabel("Predicted")
ax1.set_title("RF Regressor")
ax2.scatter(y_pred_xgbr_tuned_test, y_test, s=20)
ax2.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
ax2.set_ylabel("True")
ax2.set_xlabel("Predicted")
ax2.set_title("XGB Regressor")
ax3.scatter(y_pred_lgbmr_tuned_test, y_test, s=20)
ax3.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
ax3.set_ylabel("True")
ax3.set_xlabel("Predicted")
ax3.set_title("LGBM Regressor")
fig.suptitle("Predicted vs Ground-Truth Plot\n\n")
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
sns.heatmap(model_eval_metrics_test, annot=True, linewidths=0.3, fmt='.3f', square=True, cmap='YlGnBu_r').set_title('Model Evaluation on Test set');
In line with the Cross Validation scores of XGB Regressor, it is also the better performing Regressor for unseen Test dataset and is recommended as the best algorithm among the tested algorithms.
Let us validate our findings on XGB being the better performing model using Scikit-Learn Pipelines & GridSearchCV.
# Instantiate a pipeline
pipe = Pipeline([("regressor", RandomForestRegressor())])
# Create parameter grid with learning algorithms and their hyperparameters
grid_param = [
{
'regressor': [RandomForestRegressor(n_jobs=-1, random_state=24, verbose=2)],
'regressor__criterion': ['mse','mae'],
'regressor__n_estimators': [100,200,300,500],
'regressor__max_depth': [3,6,9]
},
{
'regressor': [xgb.XGBRegressor(booster='gbtree', objective='reg:squarederror')],
'regressor__n_estimators': [100,200,300,500],
'regressor__max_depth': [3,6,9],
'regressor__learning_rate' : [0.1,0.01,0.05],
'regressor__gamma' : [0,0.25,1.0],
'regressor__reg_lambda' : [0,1.0,10.0]
},
{
'regressor': [lgb.LGBMRegressor(boosting_type='gbdt', objective='regression', random_state=24)],
'regressor__criterion': ['mse','mae'],
'regressor__n_estimators': [100,200,300,500],
'regressor__max_depth': [3,6,9],
'regressor__learning_rate' : [0.1,0.01,0.05],
'regressor__reg_alpha' : [0,0.25,1.0],
'regressor__reg_lambda' : [0,1.0,10.0]
}
]
# Creating GridSearch of multiple models and fitting the Best Model
gridsearch = GridSearchCV(pipe, grid_param, scoring='r2', cv=KFold(n_splits=10), verbose=0,n_jobs=-1)
best_model = gridsearch.fit(X_train,y_train)
print(best_model.best_estimator_)
Pipeline(steps=[('regressor',
XGBRegressor(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, gamma=0.25, gpu_id=-1,
importance_type='gain',
interaction_constraints='', learning_rate=0.1,
max_delta_step=0, max_depth=3, min_child_weight=1,
missing=nan, monotone_constraints='()',
n_estimators=500, n_jobs=0, num_parallel_tree=1,
random_state=0, reg_alpha=0, reg_lambda=1.0,
scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1,
verbosity=None))])
print(best_model.best_params_)
{'regressor': XGBRegressor(base_score=None, booster='gbtree', colsample_bylevel=None,
colsample_bynode=None, colsample_bytree=None, gamma=0.25,
gpu_id=None, importance_type='gain', interaction_constraints=None,
learning_rate=0.1, max_delta_step=None, max_depth=3,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=500, n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None, reg_lambda=1.0,
scale_pos_weight=None, subsample=None, tree_method=None,
validate_parameters=None, verbosity=None), 'regressor__gamma': 0.25, 'regressor__learning_rate': 0.1, 'regressor__max_depth': 3, 'regressor__n_estimators': 500, 'regressor__reg_lambda': 1.0}
As expected, XGB Regressor is the best model among RF, XGB and LGBM for the given dataset.
y_pred_best_val, rmse_best_tuned_val, mae_best_tuned_val, r2_best_tuned_val = build_regressor(best_model, X_val, y_val)
Model Score (Adjusted R2) of Train set : 0.984 Model RMSE MAE R2 Regressor on Validation Set 4.46 3.10 0.93
best_cv_score = plot_cross_val_score(best_model, X_val, y_val, cv=10, alpha=0.95, scoring = 'r2')
Cross validation score (Mean): 0.875 Cross validation score (Std Dev): 0.054 CV Score Mean+-3SD : [ 0.7677006489579088, 0.9831638583035147] 95.0 confidence interval 76.1% and 92.7%
y_pred_best_test, rmse_best_tuned_test, mae_best_tuned_test, r2_best_tuned_test = build_regressor(best_model, X_test, y_test)
Model Score (Adjusted R2) of Train set : 0.984 Model RMSE MAE R2 Regressor on Validation Set 4.51 3.12 0.92
Statistical Summary and Initial EDA:
age, all other predictor features are of same scale measured in kg/m^3.age is measured in Number of days whereas strength is measured in MPa.slag and ash has skewed distributions with no values within the 25%, 50% quantiles. Distribution is also sparse considering the mean value of 73 and 54 respectively against corresponding Standard Deviations of 86 and 63 respectively. SD>Mean shows that the variance is skewed towards one of the tails.superplastic also is having skewed distribution with no values in 1st and 2d quantiles.water seems to be well distributed with values in all the quantiles & considering the min and max ranges.age is a discrete variable with values ranging from 1 to 365 days (max one year). Hence scaling would be required prior to Model Building.strength is also distributed with sufficient representation in all the quantiles.Univariate Analysis:
cement, water, ash and superplastic variables. Clustering / Gaussian Mixture models will be helpful here to analyse and understand more about them.strength variable is almost Normally distributed.Multivariate Analysis:
strength. strength variable has positive correlation with cement and very weak/no correlation with other predictor variables. Feature Engineering:
Outlier Treatment:
water_cement_ratio is included in the dataset in the place of water and cement features. Unsupervised Learning Methods for EDA & Featurisation:
Feature Importance & Feature Selection:
Baseline Model Building & Deciding on Model Complexity:
Model Building - with Non-Parametric Models:
Model Evaluation & Selection:
Model Pipeline & GridSearch